Darwin Gödel Machine
Open-Ended Evolution of Self-Improving Agents

Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, Jeff Clune
University of British Columbia, Vector Institute, Sakana AI, Canada CIFAR AI Chair
May 2025

Executive Summary

The Darwin Gödel Machine (DGM) introduces a practical framework for AI systems that autonomously modify their own code to enhance problem-solving capabilities. Unlike theoretical Gödel Machines that require formal mathematical proofs of improvement, the DGM validates modifications empirically using coding benchmarks—making self-improvement tractable in real-world settings.

The framework achieves remarkable results: on SWE-bench verified tasks, performance increased from 20.0% to 50.0% (a +150% relative improvement). On the Polyglot multilingual benchmark, performance improved from 14.2% to 30.7% (+116% relative improvement). These gains transfer robustly across different foundation models, benchmarks, and programming languages.

DGM combines three essential elements: self-referential improvement (agents modify their own Python codebases), open-ended exploration (maintaining an archive of all discovered agents as stepping stones), and iterative cycles alternating between self-modification and evaluation phases. The entire codebase is open-sourced for community verification and extension.

ELI5: AI That Rewrites Its Own Playbook

Imagine a chess player who, instead of just practicing more games, could actually rewrite the rules in their own brain about how to evaluate positions and choose moves. The Darwin Gödel Machine is like that—it's an AI that can look at its own code (its "playbook"), figure out what's not working well, and write better code for itself. The clever part is that it keeps a library of all the different versions of itself that worked reasonably well, so if one self-improvement turns out to be a dead end, it can branch off from a different version instead of being stuck.

Darwin Gödel Machine Architecture
Figure 1: Darwin Gödel Machine overview. The DGM iteratively builds a growing archive of agents by interleaving self-modification with downstream task evaluation. Each cycle selects an agent from the archive, generates modifications via an LLM, evaluates the modified agent on benchmarks, and archives successful variants.

Part 1: From Theoretical to Practical Self-Improvement

The concept of self-improving AI systems dates back to Jürgen Schmidhuber's Gödel Machines (2003)—theoretical systems that modify their own source code only after identifying a formal proof that the modification will improve performance. This requirement for mathematical proofs, while elegant, has severely limited practical applications since formal proofs of improvement are generally intractable in complex environments.

The DGM Breakthrough: Empirical Validation

The Darwin Gödel Machine replaces formal proofs with empirical validation through benchmark performance. This seemingly simple change unlocks practical self-improvement:

The choice of coding benchmarks is deliberate: since agents modify their own Python code, improvements on coding tasks directly translate to better self-modification ability. This creates a virtuous cycle where better coding leads to better self-improvement leads to even better coding.

Part 2: The DGM Framework

The Darwin Gödel Machine operates through three interconnected components that together enable open-ended self-improvement:

Three Pillars of Self-Improvement

  1. Self-Referential Improvement: Agents directly modify their own Python codebases, with performance gains on coding tasks reflecting enhanced self-modification capabilities
  2. Open-Ended Exploration: The system maintains an archive of ALL discovered agents, enabling branching from diverse stepping stones rather than hill-climbing from a single best solution
  3. Iterative Cycles: The framework alternates between self-modification phases (where selected agents generate modified versions) and evaluation phases (testing on benchmarks)
SWE-bench Performance
Figure 2: Performance progression on SWE-bench Verified. The DGM (blue) continues improving while ablations without self-improvement (orange) or without open-ended exploration (green) plateau or regress. Self-improvement and open-ended exploration enable continued progress.

The Evolutionary Loop

Each DGM cycle follows a structured process:

Phase Action Purpose
Selection Choose agent from archive Pick promising starting point for modification
Analysis LLM examines agent code Identify potential improvements
Modification LLM generates code changes Create new agent variant
Evaluation Test on benchmarks Measure empirical performance
Archival Store if viable Expand stepping stone library

Why Open-Ended Exploration Matters

Traditional optimization keeps only the best solution and improves from there. DGM maintains an archive of ALL viable agents. This is crucial because:

Ablation studies confirm this: without open-ended exploration, poor modifications trap the system in degraded states with no recovery path.

Initial Agent Architecture

The seed agent starts with standard tool-based capabilities:

The DGM automatically discovers improvements to both the tools AND the workflow of how foundation models are utilized—emergent enhancements included granular file viewing, precise string-replacement editing, and multi-attempt solving strategies.

Part 3: Benchmark Results & Transfer Learning

Primary Benchmark Results

Benchmark Initial Final Improvement
SWE-bench Verified 20.0% 50.0% +150% relative
Polyglot (Multi-language) 14.2% 30.7% +116% relative

SWE-bench tests real-world GitHub issue resolution. Polyglot evaluates coding across multiple programming languages.

Ablation Studies: Both Components Essential

Two ablation baselines revealed that both self-improvement AND open-ended exploration are necessary:

Configuration Result Explanation
Full DGM Continuous improvement Both components work synergistically
Without Self-Improvement Early plateau Replicates prior ADAS-style approaches
Without Open-Ended Archive Trapped in degraded states No recovery from poor modifications
Archive Visualization
Figure 3: Archive of generated agents visualized as a tree. Node size represents benchmark performance. Many paths to innovation traverse through lower-performing intermediate nodes, demonstrating the value of open-ended exploration over greedy hill-climbing.

Transfer Learning: Remarkable Generalization

Discovered agents demonstrate robust transfer across multiple dimensions:

Cross-Domain Transfer Results

Transfer Learning Results
Figure 4: Transfer performance across different foundation models, benchmarks, and programming languages. The discovered agents generalize well beyond their training conditions.

Part 4: Safety Considerations & Limitations

Safety Measures Implemented

The researchers implemented multiple safeguards to mitigate risks from self-modifying AI:

The authors explicitly acknowledge potential risks: "Modifications optimized solely for benchmark performance might inadvertently introduce vulnerabilities or behaviors misaligned with human intentions." This honest assessment of limitations is crucial for responsible AI development.

Current Limitations

Limitation Impact Future Direction
Compute Intensity Single run requires ~2 weeks + significant API costs More efficient search algorithms
Frozen Foundation Model Cannot improve the underlying LLM itself Training script modification
Fixed Objectives Benchmark-specific optimization Co-evolved evaluation criteria
Domain Scope Limited to coding tasks Expand to vision, creative writing, etc.

Discovered Enhancements

The DGM automatically discovered several workflow improvements that human engineers might design:

Implications for AI Development

The DGM represents a significant step toward automating AI development itself. Key insights:

  1. Self-acceleration: Systems can iteratively enhance their own capabilities without manual tuning
  2. Empirical grounding: Real-world performance replaces intractable formal proofs
  3. Diversity preservation: Maintaining stepping stones prevents catastrophic forgetting of capabilities
  4. Transfer learning: Improvements generalize beyond the specific training conditions

This suggests pathways toward self-accelerating AI development that remains aligned with human values through careful boundary setting and oversight.

Conclusion

The Darwin Gödel Machine bridges the gap between theoretical self-improvement concepts and practical implementation. By replacing formal proofs with empirical validation and maintaining diverse agent archives, DGM achieves:

The open-sourced codebase enables community verification and extension, establishing DGM as a foundation for future research in self-improving AI systems. While significant challenges remain—particularly around compute costs, safety guarantees, and broader domain applicability—DGM demonstrates that practical self-improvement is achievable with current technology.

Primary Sources

Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents
Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, Jeff Clune — arXiv:2505.22954, May 2025

GitHub Repository: jennyzzt/dgm
Open-source implementation for community verification and extension

Related: ProFiT (Program Search for Financial Trading)
Applies DGM-inspired concepts to algorithmic trading strategy evolution