ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory
Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T. Le,
Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin,
Jiawei Han, Chen-Yu Lee, and Tomas Pfister
University of Illinois Urbana-Champaign, Google Cloud AI Research, Yale University, Google Cloud AI
Executive Summary
This research introduces ReasoningBank, a novel memory framework that enables AI agents to learn from accumulated experience by distilling generalizable reasoning strategies from both successful and failed interactions. Unlike existing approaches that store raw trajectories or only successful routines, ReasoningBank creates structured memory items capturing high-level reasoning patterns. Building on this foundation, the paper introduces Memory-aware Test-Time Scaling (MaTTS), which creates a powerful synergy between memory and computational scaling. Across web browsing and software engineering benchmarks, ReasoningBank demonstrates up to 34.2% relative improvement in effectiveness while reducing interaction steps by 16.0%, establishing memory-driven experience scaling as a new dimension for agent self-evolution.
Research Context & Motivation
As Large Language Model (LLM) agents are increasingly deployed in persistent, long-running roles encountering continuous streams of tasks, a critical limitation emerges: they fail to learn from accumulated interaction history. Current agents approach each task in isolation, repeating past errors and discarding valuable insights. This necessitates building memory-aware agent systems capable of self-evolution through experience accumulation.
Key Contributions
- ReasoningBank Framework: A novel memory system that distills generalizable reasoning strategies from both successful and failed experiences, moving beyond raw trajectories and success-only patterns
- Memory-aware Test-Time Scaling (MaTTS): Two complementary approaches (parallel and sequential scaling) that create synergy between memory quality and test-time computation
- Memory-Driven Experience Scaling: Establishing experience scaling as a new dimension for agents, where better memory guides more effective scaling, while diverse experiences forge stronger memories
- Comprehensive Empirical Validation: Extensive experiments demonstrating effectiveness, efficiency, and emergent behaviors across multiple challenging benchmarks
Methodology
ReasoningBank Architecture
MEMORY SCHEMA Three-component structure:
- Title: Concise identifier summarizing core strategy
- Description: One-sentence summary
- Content: Distilled reasoning steps, decision rationales, and operational insights
Integration with Agents
- Memory Retrieval: Top-k similarity search using embedding-based retrieval (Gemini-embedding-001) to identify relevant past experiences
- Memory Construction: LLM-as-a-judge labels trajectories as success/failure; different extraction strategies applied for each type
- Memory Consolidation: New memory items incorporated into ReasoningBank, enabling continuous evolution
Memory-aware Test-Time Scaling (MaTTS)
PARALLEL SCALING Generate multiple trajectories for the same query, using self-contrast across different outcomes to identify consistent patterns and filter spurious solutions
SEQUENTIAL SCALING Iteratively refine reasoning within a single trajectory through self-refinement, using intermediate notes as valuable signals for memory synthesis
Key Findings
ReasoningBank Outperforms Existing Memory Mechanisms
Across all tested models and benchmarks, ReasoningBank consistently outperforms both memory-free agents and existing memory baselines:
- WebArena: +8.3% to +4.6% improvement over memory-free baseline across different LLM backbones
- SWE-Bench-Verified: 38.8% vs 34.2% resolve rate (Gemini-2.5-flash)
- Mind2Web: Particularly strong gains in cross-domain generalization (+0.7 absolute task-level SR improvement)
Learning from Failures is Critical
Ablation studies demonstrate that incorporating failure trajectories significantly boosts performance. ReasoningBank improves from 46.5% (success-only) to 49.7% (with failures), while baselines like AWM actually degrade performance when failures are added (44.4% → 42.2%).
MaTTS Creates Powerful Synergy
Memory-aware test-time scaling establishes a virtuous cycle:
- Better memory → Better scaling: Best-of-N improves from 40.6% (no memory) to 52.4% (ReasoningBank) at k=3
- Better scaling → Better memory: Pass@1 improves from 49.7 to 50.8 with scaling for ReasoningBank, while baselines degrade
Performance Results
WebArena Benchmark Results
Model |
Overall SR |
Overall Steps |
vs No Memory |
Gemini-2.5-flash |
No Memory |
40.5% |
9.7 |
— |
Synapse |
42.1% |
9.2 |
+1.6pp |
AWM |
44.1% |
9.0 |
+3.6pp |
ReasoningBank |
48.8% |
8.3 |
+8.3pp |
Gemini-2.5-pro |
No Memory |
46.7% |
8.8 |
— |
ReasoningBank |
53.9% |
7.4 |
+7.2pp |
Claude-3.7-sonnet |
No Memory |
41.7% |
8.0 |
— |
ReasoningBank |
46.3% |
7.3 |
+4.6pp |
SWE-Bench-Verified Results
Method |
Resolve Rate |
Avg Steps |
Improvement |
Gemini-2.5-flash |
No Memory |
34.2% |
30.3 |
— |
Synapse |
35.4% |
30.7 |
+1.2pp |
ReasoningBank |
38.8% |
27.5 |
+4.6pp |
Gemini-2.5-pro |
No Memory |
54.0% |
21.1 |
— |
ReasoningBank |
57.4% |
19.8 |
+3.4pp |
MaTTS Scaling Results (WebArena-Shopping)
Scaling Factor (k) |
Parallel Scaling SR |
Sequential Scaling SR |
Baseline (No Memory) |
k = 1 |
49.7% |
49.7% |
39.0% |
k = 2 |
50.3% |
51.9% |
39.4% |
k = 3 |
52.4% |
53.5% |
42.2% |
k = 4 |
54.0% |
54.0% |
41.7% |
k = 5 |
55.1% |
54.5% |
42.2% |
Emergent Behaviors & Strategic Evolution
A remarkable finding is that ReasoningBank exhibits emergent self-evolution, where memory items progressively mature through test-time learning:
- Stage 1 - Procedural/Execution: Simple action rules like "actively look for and click on 'Next Page' links"
- Stage 2 - Atomic Self-Reflection: Basic verification strategies like "re-check the element's current identifier"
- Stage 3 - Evolved Adaptive Checks: Systematic strategies like "leverage available search or filter functionalities, ensure completeness before reporting"
- Stage 4 - Generalized Complex Strategies: High-level reasoning like "regularly cross-referencing the current view with task requirements, reassess available options when data doesn't align with expectations"
This evolution resembles reinforcement learning dynamics where agents develop increasingly sophisticated strategies through experience accumulation.
Efficiency Analysis
Breakdown: Successful vs Failed Trajectories
Domain |
No Memory (Success) |
ReasoningBank (Success) |
Reduction |
Shopping |
6.8 steps |
4.7 steps |
-2.1 steps (30.9%) |
Admin |
8.4 steps |
7.0 steps |
-1.4 steps (16.7%) |
Gitlab |
8.6 steps |
7.6 steps |
-1.0 steps (11.6%) |
Reddit |
6.1 steps |
5.0 steps |
-1.1 steps (18.0%) |
Key Insight: ReasoningBank achieves particularly pronounced step reductions on successful cases (up to 30.9%), indicating that memory primarily helps agents reach solutions more efficiently by following effective reasoning paths, rather than simply truncating failed attempts.
Comparison with Existing Approaches
Memory Types Comparison
Approach |
Memory Content |
Learns from Failures |
Abstraction Level |
Raw Trajectories (Synapse) |
Complete action sequences |
❌ No |
Low (specific actions) |
Workflow Memory (AWM) |
Common successful routines |
❌ No |
Medium (procedural patterns) |
ReasoningBank |
Distilled reasoning strategies |
✅ Yes |
High (generalizable principles) |
Implications for Production Systems
System Design Considerations
- Continuous Learning: Deploy agents in persistent roles where they accumulate experiences across multiple user sessions
- Memory Infrastructure: Implement efficient vector stores with embedding-based retrieval for real-time memory access
- Failure Recovery: Design systems that explicitly learn from failures, not just successes, to build comprehensive reasoning capabilities
- Test-Time Scaling: Allocate additional compute for critical tasks through MaTTS to achieve higher reliability
Performance Budgeting for MaTTS
Organizations can trade compute for reliability using scaling factor k:
- Standard reliability (50%): k = 1 (baseline)
- High reliability (52-53%): k = 2-3 (2-3× compute)
- Critical reliability (54-55%): k = 4-5 (4-5× compute)
Note: Returns begin to diminish after k = 5, suggesting optimal scaling range of 2-5× for most applications.
Technical Architecture Insights
Memory Extraction Pipeline
- Trajectory Completion: Agent executes task and generates trajectory
- Self-Judgment: LLM-as-a-judge evaluates success/failure without ground truth
- Strategy Extraction:
- Success trajectories: Extract validated strategies and effective patterns
- Failure trajectories: Extract counterfactual signals and preventative lessons
- Structured Storage: Store as {title, description, content} with pre-computed embeddings
Self-Contrast in Parallel Scaling
Instead of relying on external quality judges, MaTTS guides the model to directly compare multiple trajectories for the same query, identifying:
- Patterns that consistently lead to success
- Mistakes that cause failure
- Contrastive signals that distinguish good from bad reasoning
Limitations & Future Directions
Current Limitations
- Simple Retrieval Mechanism: Uses basic embedding similarity; could benefit from more sophisticated multi-hop reasoning-based retrieval
- No Memory Composition: Treats memory items independently; doesn't explore compositional strategies
- LLM-as-Judge Dependency: Success/failure signals from self-evaluation may introduce noise in ambiguous cases
- Minimal Consolidation: Simply appends new memories; lacks sophisticated pruning or merging strategies
Future Research Opportunities
- Hierarchical Memory Architecture: Integrate episodic (per-task), working (within-session), and long-term (consolidated) memory tiers
- Compositional Memory: Enable combining multiple memory items into higher-level macro-strategies
- Advanced Retrieval: Implement reasoning-intensive controllers that decompose queries and plan multi-hop lookups
- Memory Management: Develop automatic pruning, merging, and decay policies for scalable long-term deployment
Conclusions
ReasoningBank represents a significant advance in building self-evolving AI agents through memory-driven experience accumulation. Key takeaways:
- Memory Content Matters: Distilling high-level reasoning strategies from both successes and failures outperforms storing raw trajectories or success-only patterns
- Memory Enables Scaling: MaTTS creates powerful synergy where better memory guides more effective test-time scaling, while scaling generates richer experiences for better memory
- Emergent Self-Evolution: Agents exhibit progressively sophisticated reasoning strategies over time, evolving from procedural actions to complex adaptive reasoning
- Practical Benefits: Consistent improvements in both effectiveness (up to 34.2% relative) and efficiency (16.0% fewer steps), with particularly strong gains in generalization scenarios
- New Scaling Dimension: Establishes memory-driven experience scaling as a complementary dimension to parameter scaling and test-time compute scaling
For practitioners building AI agent systems:
- Implement memory systems that capture reasoning strategies, not just action sequences
- Design for learning from failures, not just successes
- Consider test-time scaling with memory-aware approaches for critical tasks
- Architect for continuous self-evolution in persistent deployment scenarios
ReasoningBank opens pathways toward truly adaptive, lifelong-learning agents that improve through experience accumulation, bringing us closer to autonomous systems that naturally evolve and enhance their capabilities over time.
References
- Ouyang, S., Yan, J., Hsu, I-H., et al. "ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory." arXiv preprint arXiv:2509.25140v1 [cs.AI], September 2025.
- Benchmarks: WebArena (Zhou et al., 2024), Mind2Web (Deng et al., 2023), SWE-Bench-Verified (Jimenez et al., 2024)
- Baselines: Synapse (Zheng et al., 2024), Agent Workflow Memory/AWM (Wang et al., 2025d)
- Related Work: Test-Time Scaling (Snell et al., 2025), Self-Refinement (Madaan et al., 2023)
Report compiled for AI Agent Engineering Research Collection
For more resources, visit join.maxpool.dev