DreamGym: Scaling Agent Learning via Experience Synthesis

Zhaorun Chen, Zhuokai Zhao, Kai Zhang, Bo Liu, Qi Qi, Yifan Wu, Tarun Kalluri, Sara Cao,
Yuanhao Xiong, Haibo Tong, Huaxiu Yao, Hengduo Li, Jiacheng Zhu, Xian Li, Dawn Song, Bo Li,
Jason Weston, Dat Huynh
Meta Superintelligence Labs, FAIR at Meta, University of Chicago, UC Berkeley
November 2025 | arXiv:2511.03773

Executive Summary

DreamGym introduces a breakthrough approach to training AI agents through synthetic experience synthesis. Rather than relying on expensive real-world interactions, DreamGym creates a reasoning-based "experience model" that simulates environment dynamics, enabling scalable reinforcement learning for autonomous agents at a fraction of the cost.

The framework addresses four critical challenges in agent training: costly rollouts, limited task diversity, unreliable reward signals, and infrastructure complexity. By distilling environment dynamics into step-by-step reasoning, DreamGym achieves over 30% improvement on non-RL-ready tasks like WebArena while matching state-of-the-art performance on traditional benchmarks—using only synthetic experiences.

🎯 ELI5: The Core Concept

Imagine teaching a robot to navigate a shopping website. Traditionally, the robot would need to click through thousands of real web pages, which is slow and expensive. DreamGym creates a "mental simulator"—an AI that imagines what happens when the robot clicks different buttons. The robot practices in this imaginary world millions of times, learning quickly and cheaply. When it finally goes to the real website, it already knows what to do. It's like practicing driving in a video game before getting behind the wheel of a real car.

The Problem: Why Training AI Agents is So Hard

Training autonomous AI agents with reinforcement learning (RL) faces fundamental barriers that have limited practical adoption:

Four Critical Challenges

Prohibitive Cost: Real environments involve long interaction sequences and high computational cost per step. Collecting sufficient data for modern RL algorithms is expensive.
Limited Task Diversity: Most environments provide only a static set of tasks, while RL requires broad task coverage for effective exploration.
Unstable Rewards: Dynamic environments like web pages produce noisy, sparse, or false feedback that hinders stable learning.
Infrastructure Complexity: Existing systems are heterogeneous, often requiring Docker or VMs, making large-batch sampling engineering-intensive.

These limitations make building general-purpose, scalable RL systems an open challenge. For instance, WebArena—a realistic web navigation benchmark—is so costly to run at scale that even large research labs struggle to perform extensive RL training on it.

The DreamGym Solution: Experience Synthesis

DreamGym reframes the problem: instead of requiring perfect environment simulation, it focuses on synthesizing experiences that are sufficiently diverse, informative, and causally grounded to enable learning. The key insight is that agent training doesn't need perfectly realistic environments—it needs interaction data that teaches the right skills.

Three Core Components

1 Reasoning Experience Model: A scalable LLM-based model that operates in abstract textual state space. Rather than reproducing raw HTML or pixel data, it synthesizes clean, informative state representations. The model uses chain-of-thought reasoning to predict next states and rewards, maintaining causal consistency across multi-turn interactions.

2 Experience Replay Buffer: Initialized with offline real-world data and continuously enriched with synthetic trajectories. The buffer retrieves similar past experiences to guide current predictions, improving factuality and reducing hallucinations. It co-evolves with the agent policy to stay aligned.

3 Curriculum Task Generator: Identifies valuable tasks using reward entropy as a proxy for challenge level. Tasks with high entropy (50/50 success rate) provide maximum information gain. The generator produces progressively harder variations, creating an adaptive curriculum.

How It Works: The Training Loop

Training the Experience Model

The experience model is surprisingly data-efficient to train. For WebArena, only 4,800 offline trajectories were needed. For WebShop, 3,600 trajectories sufficed. Each transition is annotated with reasoning traces generated by a strong teacher LLM, then fine-tuned via supervised learning with a joint objective:

L_SFT = -log P(reasoning | context) - log P(next_state | reasoning, context)

This ensures the model learns both to generate faithful reasoning and to leverage that reasoning for consistent state prediction.

Experimental Results: Dramatic Improvements

Non-RL-Ready Environments: WebArena

WebArena represents realistic web navigation across e-commerce, forums, GitLab, and content management systems. Its infrastructure makes large-scale RL training impractical—requiring expensive AWS servers with manual resets and suffering from unreliable evaluation functions.

WebArena Performance (Over 30% Improvement)

Model	Training Method	Success Rate	Real Data Used
Llama-3.2-3B	SFT (baseline)	6.1%	20K transitions
Llama-3.2-3B	Traditional GRPO	7.3%	80K transitions
Llama-3.2-3B	DreamGym	13.3%	0
Llama-3.1-8B	Traditional PPO	4.8%	80K transitions
Llama-3.1-8B	DreamGym	10.9%	0
Qwen-2.5-7B	DreamGym	10.0%	0

DreamGym provides the only viable approach for RL-based training on WebArena, delivering 82% to 127% relative improvement over baselines while using zero real environment interactions.

RL-Ready Environments: Matching SOTA with Pure Synthesis

On WebShop (e-commerce reasoning) and ALFWorld (embodied control), DreamGym demonstrates that synthetic training alone can match traditional RL methods:

Sim-to-Real Transfer: Best of Both Worlds

DreamGym-S2R (sim-to-real) combines synthetic pretraining with limited real-world fine-tuning. Agents first train entirely in DreamGym, acquiring broad knowledge across diverse curriculum tasks, then transfer to real environments for final polish.

Environment	Model	Traditional RL	DreamGym (0 real data)	DreamGym-S2R (5K real data)
WebShop	Llama-3.1-8B	65.0% (GRPO)	63.9%	75.0%
Qwen-2.5-7B	68.1% (PPO)	65.0%	73.7%
ALFWorld	Llama-3.1-8B	70.9% (GRPO)	66.3%	75.9%
Qwen-2.5-7B	81.1% (PPO)	72.7%	79.9%

Dramatic Efficiency Gains

40%+ performance improvement compared to training from scratch in real environments
90% reduction in real-world data requirements (5K vs 80K transitions)
3-5× faster training by reducing real environment rollout time

This provides a scalable warm-start strategy: bootstrap with cheap synthetic data, then fine-tune with minimal real interactions.

Why Does This Work? Theoretical Insights

DreamGym includes a theoretical analysis proving that policies trained in synthetic environments can achieve guaranteed improvement in real environments, under mild assumptions.

Policy Improvement Guarantee

The key insight: performance in the real environment depends on two learnable error terms:

Reward Accuracy (ε_R): How faithfully the experience model's rewards reflect real outcomes
Domain Consistency (ε_P): How well synthetic state distributions match real environment dynamics

Critically, these do NOT require perfect state reconstruction. The synthetic environment needs only to provide domain-consistent transitions and correct learning signals—not pixel-perfect simulation.

J_real(π') ≥ J_synthetic(π') - 2(ε_R/(1-γ) + 2γR_max*ε_P/(1-γ)²)

This validates the design philosophy: focus on learning-relevant signals, not raw state fidelity.

Ablation Studies: What Matters Most?

Component Analysis

Data Efficiency: How Much Offline Data is Needed?

Surprisingly little. The experience model achieves competitive performance with just 2,000-10,000 offline transitions:

This data efficiency comes from operating in abstract state space—rather than learning raw HTML or pixel dynamics, the model learns high-level semantic transitions.

Cross-Domain Transfer: How General Are Learned Policies?

Configuration	WebShop Success %	WebArena Success %
Full DreamGym	63.9	13.3
w/o Experience Replay	59.2 (-4.7)	9.7 (-3.6)
w/o Experience Reasoning	55.8 (-8.1)	7.3 (-6.0)
w/o Task Generation	57.3 (-6.6)	7.3 (-6.0)

Transfer Learning Results

WebShop → WebArena: A policy trained on WebShop (e-commerce) transfers to WebArena (general web navigation) and outperforms SFT baselines trained directly on WebArena
WebArena → WebShop: Bidirectional transfer works—WebArena training transfers back to WebShop with superior performance
Web → Embodied Control (ALFWorld): Transfer fails when the domain gap is too large, indicating limits of current meta-representations

This suggests DreamGym learns domain-agnostic behavioral priors within similar task families, rather than memorizing task-specific patterns.

Practical Engineering Implications

When Should You Use DreamGym?

Ideal Use Cases

High-cost environments Real interactions are expensive (e.g., web agents requiring cloud infrastructure)
Non-RL-ready tasks Environments lack reliable reset mechanisms or produce unstable rewards
Task diversity needed Static task sets limit exploration—curriculum generation helps
Warm-start scenarios You have limited real-world budget but need strong initialization (DreamGym-S2R)

Implementation Considerations

Computational Requirements: Experiments used 8 nodes with A100 GPUs and 4 nodes with H100 GPUs. However, the abstract state space makes experience model training surprisingly efficient—most compute goes to policy training, not environment simulation.

Hyperparameters: The framework includes a hyperparameter λ that bounds the proportion of synthetic tasks per iteration, balancing original task coverage with curriculum-driven exploration. Typical values: λ ∈ [0.2, 0.5].

Experience Model Choice: All main results use Llama-3.1-8B as the experience model backbone. Smaller models (3B) work but with reduced performance. Domain-specific pretraining (like WebDreamer for web tasks) helps at low data scales but converges to similar performance with more data.

Limitations and Future Directions

Known Limitations

Single-environment focus: Current work trains separate experience models per environment. A universal world model could enable zero-shot transfer.
Domain gaps: Transfer fails across very different domains (e.g., web → robotics). Better meta-representations needed.
Long-horizon tasks: Experiments focus on tasks completable in 10-50 steps. Scaling to multi-hour tasks remains open.
Benchmark vs. reality: Real-world tasks may have hidden complexity not captured in current benchmarks.

Future Research Directions

Conclusion

DreamGym represents a paradigm shift in agent training: moving from expensive real-environment sampling to scalable synthetic experience synthesis. By focusing on learning-relevant signals rather than perfect simulation, it achieves dramatic improvements in both cost and performance.

The framework's success validates a key insight: agent training doesn't require perfectly realistic environments, but rather interaction data that is sufficiently diverse, informative, and causally grounded. This opens the door to training general-purpose agents at scale, previously limited by infrastructure and cost constraints.

For practitioners, the message is clear: invest in experience models for high-cost or non-RL-ready environments. For researchers, the challenge is extending these ideas to universal world models that enable zero-shot adaptation across arbitrary domains. The era of scalable agent learning via experience synthesis has begun.

Primary Source

Chen et al.: "Scaling Agent Learning via Experience Synthesis"
arXiv:2511.03773 [cs.AI], November 2025

Authors: Zhaorun Chen, Zhuokai Zhao, Kai Zhang, Bo Liu, Qi Qi, Yifan Wu, Tarun Kalluri, Sara Cao, Yuanhao Xiong, Haibo Tong, Huaxiu Yao, Hengduo Li, Jiacheng Zhu, Xian Li, Dawn Song, Bo Li, Jason Weston, Dat Huynh

Affiliations: Meta Superintelligence Labs, FAIR at Meta, University of Chicago, UC Berkeley

Scaling Agent Learning via Experience SynthesisThe DreamGym Framework