DreamGym introduces a breakthrough approach to training AI agents through synthetic experience synthesis. Rather than relying on expensive real-world interactions, DreamGym creates a reasoning-based "experience model" that simulates environment dynamics, enabling scalable reinforcement learning for autonomous agents at a fraction of the cost.
The framework addresses four critical challenges in agent training: costly rollouts, limited task diversity, unreliable reward signals, and infrastructure complexity. By distilling environment dynamics into step-by-step reasoning, DreamGym achieves over 30% improvement on non-RL-ready tasks like WebArena while matching state-of-the-art performance on traditional benchmarks—using only synthetic experiences.
Imagine teaching a robot to navigate a shopping website. Traditionally, the robot would need to click through thousands of real web pages, which is slow and expensive. DreamGym creates a "mental simulator"—an AI that imagines what happens when the robot clicks different buttons. The robot practices in this imaginary world millions of times, learning quickly and cheaply. When it finally goes to the real website, it already knows what to do. It's like practicing driving in a video game before getting behind the wheel of a real car.
Training autonomous AI agents with reinforcement learning (RL) faces fundamental barriers that have limited practical adoption:
These limitations make building general-purpose, scalable RL systems an open challenge. For instance, WebArena—a realistic web navigation benchmark—is so costly to run at scale that even large research labs struggle to perform extensive RL training on it.
DreamGym reframes the problem: instead of requiring perfect environment simulation, it focuses on synthesizing experiences that are sufficiently diverse, informative, and causally grounded to enable learning. The key insight is that agent training doesn't need perfectly realistic environments—it needs interaction data that teaches the right skills.
1 Reasoning Experience Model: A scalable LLM-based model that operates in abstract textual state space. Rather than reproducing raw HTML or pixel data, it synthesizes clean, informative state representations. The model uses chain-of-thought reasoning to predict next states and rewards, maintaining causal consistency across multi-turn interactions.
2 Experience Replay Buffer: Initialized with offline real-world data and continuously enriched with synthetic trajectories. The buffer retrieves similar past experiences to guide current predictions, improving factuality and reducing hallucinations. It co-evolves with the agent policy to stay aligned.
3 Curriculum Task Generator: Identifies valuable tasks using reward entropy as a proxy for challenge level. Tasks with high entropy (50/50 success rate) provide maximum information gain. The generator produces progressively harder variations, creating an adaptive curriculum.
The experience model is surprisingly data-efficient to train. For WebArena, only 4,800 offline trajectories were needed. For WebShop, 3,600 trajectories sufficed. Each transition is annotated with reasoning traces generated by a strong teacher LLM, then fine-tuned via supervised learning with a joint objective:
This ensures the model learns both to generate faithful reasoning and to leverage that reasoning for consistent state prediction.
WebArena represents realistic web navigation across e-commerce, forums, GitLab, and content management systems. Its infrastructure makes large-scale RL training impractical—requiring expensive AWS servers with manual resets and suffering from unreliable evaluation functions.
| Model | Training Method | Success Rate | Real Data Used |
|---|---|---|---|
| Llama-3.2-3B | SFT (baseline) | 6.1% | 20K transitions |
| Llama-3.2-3B | Traditional GRPO | 7.3% | 80K transitions |
| Llama-3.2-3B | DreamGym | 13.3% | 0 |
| Llama-3.1-8B | Traditional PPO | 4.8% | 80K transitions |
| Llama-3.1-8B | DreamGym | 10.9% | 0 |
| Qwen-2.5-7B | DreamGym | 10.0% | 0 |
DreamGym provides the only viable approach for RL-based training on WebArena, delivering 82% to 127% relative improvement over baselines while using zero real environment interactions.
On WebShop (e-commerce reasoning) and ALFWorld (embodied control), DreamGym demonstrates that synthetic training alone can match traditional RL methods:
| Environment | Model | Traditional RL | DreamGym (0 real data) | DreamGym-S2R (5K real data) |
|---|---|---|---|---|
| WebShop | Llama-3.1-8B | 65.0% (GRPO) | 63.9% | 75.0% |
| Qwen-2.5-7B | 68.1% (PPO) | 65.0% | 73.7% | |
| ALFWorld | Llama-3.1-8B | 70.9% (GRPO) | 66.3% | 75.9% |
| Qwen-2.5-7B | 81.1% (PPO) | 72.7% | 79.9% |
DreamGym-S2R (sim-to-real) combines synthetic pretraining with limited real-world fine-tuning. Agents first train entirely in DreamGym, acquiring broad knowledge across diverse curriculum tasks, then transfer to real environments for final polish.
This provides a scalable warm-start strategy: bootstrap with cheap synthetic data, then fine-tune with minimal real interactions.
DreamGym includes a theoretical analysis proving that policies trained in synthetic environments can achieve guaranteed improvement in real environments, under mild assumptions.
The key insight: performance in the real environment depends on two learnable error terms:
Critically, these do NOT require perfect state reconstruction. The synthetic environment needs only to provide domain-consistent transitions and correct learning signals—not pixel-perfect simulation.
This validates the design philosophy: focus on learning-relevant signals, not raw state fidelity.
| Configuration | WebShop Success % | WebArena Success % |
|---|---|---|
| Full DreamGym | 63.9 | 13.3 |
| w/o Experience Replay | 59.2 (-4.7) | 9.7 (-3.6) |
| w/o Experience Reasoning | 55.8 (-8.1) | 7.3 (-6.0) |
| w/o Task Generation | 57.3 (-6.6) | 7.3 (-6.0) |
Key Takeaways:
Surprisingly little. The experience model achieves competitive performance with just 2,000-10,000 offline transitions:
This data efficiency comes from operating in abstract state space—rather than learning raw HTML or pixel dynamics, the model learns high-level semantic transitions.
DreamGym demonstrates remarkable generalization within similar domains:
This suggests DreamGym learns domain-agnostic behavioral priors within similar task families, rather than memorizing task-specific patterns.
Computational Requirements: Experiments used 8 nodes with A100 GPUs and 4 nodes with H100 GPUs. However, the abstract state space makes experience model training surprisingly efficient—most compute goes to policy training, not environment simulation.
Hyperparameters: The framework includes a hyperparameter λ that bounds the proportion of synthetic tasks per iteration, balancing original task coverage with curriculum-driven exploration. Typical values: λ ∈ [0.2, 0.5].
Experience Model Choice: All main results use Llama-3.1-8B as the experience model backbone. Smaller models (3B) work but with reduced performance. Domain-specific pretraining (like WebDreamer for web tasks) helps at low data scales but converges to similar performance with more data.
DreamGym represents a paradigm shift in agent training: moving from expensive real-environment sampling to scalable synthetic experience synthesis. By focusing on learning-relevant signals rather than perfect simulation, it achieves dramatic improvements in both cost and performance.
The framework's success validates a key insight: agent training doesn't require perfectly realistic environments, but rather interaction data that is sufficiently diverse, informative, and causally grounded. This opens the door to training general-purpose agents at scale, previously limited by infrastructure and cost constraints.
For practitioners, the message is clear: invest in experience models for high-cost or non-RL-ready environments. For researchers, the challenge is extending these ideas to universal world models that enable zero-shot adaptation across arbitrary domains. The era of scalable agent learning via experience synthesis has begun.
Chen et al.: "Scaling Agent Learning via Experience Synthesis"
arXiv:2511.03773 [cs.AI], November 2025
Authors: Zhaorun Chen, Zhuokai Zhao, Kai Zhang, Bo Liu, Qi Qi, Yifan Wu, Tarun Kalluri, Sara Cao, Yuanhao Xiong, Haibo Tong, Huaxiu Yao, Hengduo Li, Jiacheng Zhu, Xian Li, Dawn Song, Bo Li, Jason Weston, Dat Huynh
Affiliations: Meta Superintelligence Labs, FAIR at Meta, University of Chicago, UC Berkeley