Scaling Agent Learning via Experience Synthesis
The DreamGym Framework

Zhaorun Chen, Zhuokai Zhao, Kai Zhang, Bo Liu, Qi Qi, Yifan Wu, Tarun Kalluri, Sara Cao,
Yuanhao Xiong, Haibo Tong, Huaxiu Yao, Hengduo Li, Jiacheng Zhu, Xian Li, Dawn Song, Bo Li,
Jason Weston, Dat Huynh
Meta Superintelligence Labs, FAIR at Meta, University of Chicago, UC Berkeley
November 2025 | arXiv:2511.03773

Executive Summary

DreamGym introduces a breakthrough approach to training AI agents through synthetic experience synthesis. Rather than relying on expensive real-world interactions, DreamGym creates a reasoning-based "experience model" that simulates environment dynamics, enabling scalable reinforcement learning for autonomous agents at a fraction of the cost.

The framework addresses four critical challenges in agent training: costly rollouts, limited task diversity, unreliable reward signals, and infrastructure complexity. By distilling environment dynamics into step-by-step reasoning, DreamGym achieves over 30% improvement on non-RL-ready tasks like WebArena while matching state-of-the-art performance on traditional benchmarks—using only synthetic experiences.

🎯 ELI5: The Core Concept

Imagine teaching a robot to navigate a shopping website. Traditionally, the robot would need to click through thousands of real web pages, which is slow and expensive. DreamGym creates a "mental simulator"—an AI that imagines what happens when the robot clicks different buttons. The robot practices in this imaginary world millions of times, learning quickly and cheaply. When it finally goes to the real website, it already knows what to do. It's like practicing driving in a video game before getting behind the wheel of a real car.

The Problem: Why Training AI Agents is So Hard

Training autonomous AI agents with reinforcement learning (RL) faces fundamental barriers that have limited practical adoption:

Four Critical Challenges

These limitations make building general-purpose, scalable RL systems an open challenge. For instance, WebArena—a realistic web navigation benchmark—is so costly to run at scale that even large research labs struggle to perform extensive RL training on it.

The DreamGym Solution: Experience Synthesis

DreamGym reframes the problem: instead of requiring perfect environment simulation, it focuses on synthesizing experiences that are sufficiently diverse, informative, and causally grounded to enable learning. The key insight is that agent training doesn't need perfectly realistic environments—it needs interaction data that teaches the right skills.

Three Core Components

1 Reasoning Experience Model: A scalable LLM-based model that operates in abstract textual state space. Rather than reproducing raw HTML or pixel data, it synthesizes clean, informative state representations. The model uses chain-of-thought reasoning to predict next states and rewards, maintaining causal consistency across multi-turn interactions.

2 Experience Replay Buffer: Initialized with offline real-world data and continuously enriched with synthetic trajectories. The buffer retrieves similar past experiences to guide current predictions, improving factuality and reducing hallucinations. It co-evolves with the agent policy to stay aligned.

3 Curriculum Task Generator: Identifies valuable tasks using reward entropy as a proxy for challenge level. Tasks with high entropy (50/50 success rate) provide maximum information gain. The generator produces progressively harder variations, creating an adaptive curriculum.

How It Works: The Training Loop

  1. Agent takes action: The policy selects an action based on current state
  2. Experience model predicts: Using CoT reasoning, interaction history, and retrieved examples, the model predicts the next state and reward
  3. Trajectory collection: Multi-turn rollouts are collected entirely within DreamGym
  4. Policy update: Standard RL algorithms (PPO or GRPO) update the agent policy
  5. Curriculum expansion: High-entropy tasks are identified and varied to generate new challenging tasks
  6. Repeat: The cycle continues until convergence or budget is reached

Training the Experience Model

The experience model is surprisingly data-efficient to train. For WebArena, only 4,800 offline trajectories were needed. For WebShop, 3,600 trajectories sufficed. Each transition is annotated with reasoning traces generated by a strong teacher LLM, then fine-tuned via supervised learning with a joint objective:

L_SFT = -log P(reasoning | context) - log P(next_state | reasoning, context)

This ensures the model learns both to generate faithful reasoning and to leverage that reasoning for consistent state prediction.

Experimental Results: Dramatic Improvements

Non-RL-Ready Environments: WebArena

WebArena represents realistic web navigation across e-commerce, forums, GitLab, and content management systems. Its infrastructure makes large-scale RL training impractical—requiring expensive AWS servers with manual resets and suffering from unreliable evaluation functions.

WebArena Performance (Over 30% Improvement)

Model Training Method Success Rate Real Data Used
Llama-3.2-3B SFT (baseline) 6.1% 20K transitions
Llama-3.2-3B Traditional GRPO 7.3% 80K transitions
Llama-3.2-3B DreamGym 13.3% 0
Llama-3.1-8B Traditional PPO 4.8% 80K transitions
Llama-3.1-8B DreamGym 10.9% 0
Qwen-2.5-7B DreamGym 10.0% 0

DreamGym provides the only viable approach for RL-based training on WebArena, delivering 82% to 127% relative improvement over baselines while using zero real environment interactions.

RL-Ready Environments: Matching SOTA with Pure Synthesis

On WebShop (e-commerce reasoning) and ALFWorld (embodied control), DreamGym demonstrates that synthetic training alone can match traditional RL methods:

Environment Model Traditional RL DreamGym (0 real data) DreamGym-S2R (5K real data)
WebShop Llama-3.1-8B 65.0% (GRPO) 63.9% 75.0%
Qwen-2.5-7B 68.1% (PPO) 65.0% 73.7%
ALFWorld Llama-3.1-8B 70.9% (GRPO) 66.3% 75.9%
Qwen-2.5-7B 81.1% (PPO) 72.7% 79.9%

Sim-to-Real Transfer: Best of Both Worlds

DreamGym-S2R (sim-to-real) combines synthetic pretraining with limited real-world fine-tuning. Agents first train entirely in DreamGym, acquiring broad knowledge across diverse curriculum tasks, then transfer to real environments for final polish.

Dramatic Efficiency Gains

This provides a scalable warm-start strategy: bootstrap with cheap synthetic data, then fine-tune with minimal real interactions.

Why Does This Work? Theoretical Insights

DreamGym includes a theoretical analysis proving that policies trained in synthetic environments can achieve guaranteed improvement in real environments, under mild assumptions.

Policy Improvement Guarantee

The key insight: performance in the real environment depends on two learnable error terms:

Critically, these do NOT require perfect state reconstruction. The synthetic environment needs only to provide domain-consistent transitions and correct learning signals—not pixel-perfect simulation.

J_real(π') ≥ J_synthetic(π') - 2(ε_R/(1-γ) + 2γR_max*ε_P/(1-γ)²)

This validates the design philosophy: focus on learning-relevant signals, not raw state fidelity.

Ablation Studies: What Matters Most?

Component Analysis

Configuration WebShop Success % WebArena Success %
Full DreamGym 63.9 13.3
w/o Experience Replay 59.2 (-4.7) 9.7 (-3.6)
w/o Experience Reasoning 55.8 (-8.1) 7.3 (-6.0)
w/o Task Generation 57.3 (-6.6) 7.3 (-6.0)

Key Takeaways:

Data Efficiency: How Much Offline Data is Needed?

Surprisingly little. The experience model achieves competitive performance with just 2,000-10,000 offline transitions:

This data efficiency comes from operating in abstract state space—rather than learning raw HTML or pixel dynamics, the model learns high-level semantic transitions.

Cross-Domain Transfer: How General Are Learned Policies?

DreamGym demonstrates remarkable generalization within similar domains:

Transfer Learning Results

This suggests DreamGym learns domain-agnostic behavioral priors within similar task families, rather than memorizing task-specific patterns.

Practical Engineering Implications

When Should You Use DreamGym?

Ideal Use Cases

Implementation Considerations

Computational Requirements: Experiments used 8 nodes with A100 GPUs and 4 nodes with H100 GPUs. However, the abstract state space makes experience model training surprisingly efficient—most compute goes to policy training, not environment simulation.

Hyperparameters: The framework includes a hyperparameter λ that bounds the proportion of synthetic tasks per iteration, balancing original task coverage with curriculum-driven exploration. Typical values: λ ∈ [0.2, 0.5].

Experience Model Choice: All main results use Llama-3.1-8B as the experience model backbone. Smaller models (3B) work but with reduced performance. Domain-specific pretraining (like WebDreamer for web tasks) helps at low data scales but converges to similar performance with more data.

Limitations and Future Directions

Known Limitations

Future Research Directions

  1. Universal experience models: Train a single model across multiple environments to enable cross-domain transfer and zero-shot adaptation
  2. Improved curriculum learning: Beyond reward entropy, incorporate novelty search, surprise, or intrinsic motivation for task generation
  3. Memory and error recovery: Integrate external memory systems to enable agents to recognize and correct mistakes, potentially breaking the constant hazard rate
  4. Multi-agent scenarios: Extend to competitive or cooperative settings where multiple agents interact
  5. Real-world deployment: Test on production systems beyond academic benchmarks

Conclusion

DreamGym represents a paradigm shift in agent training: moving from expensive real-environment sampling to scalable synthetic experience synthesis. By focusing on learning-relevant signals rather than perfect simulation, it achieves dramatic improvements in both cost and performance.

The framework's success validates a key insight: agent training doesn't require perfectly realistic environments, but rather interaction data that is sufficiently diverse, informative, and causally grounded. This opens the door to training general-purpose agents at scale, previously limited by infrastructure and cost constraints.

For practitioners, the message is clear: invest in experience models for high-cost or non-RL-ready environments. For researchers, the challenge is extending these ideas to universal world models that enable zero-shot adaptation across arbitrary domains. The era of scalable agent learning via experience synthesis has begun.

Primary Source

Chen et al.: "Scaling Agent Learning via Experience Synthesis"
arXiv:2511.03773 [cs.AI], November 2025

Authors: Zhaorun Chen, Zhuokai Zhao, Kai Zhang, Bo Liu, Qi Qi, Yifan Wu, Tarun Kalluri, Sara Cao, Yuanhao Xiong, Haibo Tong, Huaxiu Yao, Hengduo Li, Jiacheng Zhu, Xian Li, Dawn Song, Bo Li, Jason Weston, Dat Huynh

Affiliations: Meta Superintelligence Labs, FAIR at Meta, University of Chicago, UC Berkeley