ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T. Le,
Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin,
Jiawei Han, Chen-Yu Lee, and Tomas Pfister
University of Illinois Urbana-Champaign, Google Cloud AI Research, Yale University, Google Cloud AI

Executive Summary

This research introduces ReasoningBank, a novel memory framework that enables AI agents to learn from accumulated experience by distilling generalizable reasoning strategies from both successful and failed interactions. Unlike existing approaches that store raw trajectories or only successful routines, ReasoningBank creates structured memory items capturing high-level reasoning patterns. Building on this foundation, the paper introduces Memory-aware Test-Time Scaling (MaTTS), which creates a powerful synergy between memory and computational scaling. Across web browsing and software engineering benchmarks, ReasoningBank demonstrates up to 34.2% relative improvement in effectiveness while reducing interaction steps by 16.0%, establishing memory-driven experience scaling as a new dimension for agent self-evolution.

Research Context & Motivation

As Large Language Model (LLM) agents are increasingly deployed in persistent, long-running roles encountering continuous streams of tasks, a critical limitation emerges: they fail to learn from accumulated interaction history. Current agents approach each task in isolation, repeating past errors and discarding valuable insights. This necessitates building memory-aware agent systems capable of self-evolution through experience accumulation.

Key Contributions

  1. ReasoningBank Framework: A novel memory system that distills generalizable reasoning strategies from both successful and failed experiences, moving beyond raw trajectories and success-only patterns
  2. Memory-aware Test-Time Scaling (MaTTS): Two complementary approaches (parallel and sequential scaling) that create synergy between memory quality and test-time computation
  3. Memory-Driven Experience Scaling: Establishing experience scaling as a new dimension for agents, where better memory guides more effective scaling, while diverse experiences forge stronger memories
  4. Comprehensive Empirical Validation: Extensive experiments demonstrating effectiveness, efficiency, and emergent behaviors across multiple challenging benchmarks

Methodology

ReasoningBank Architecture

MEMORY SCHEMA Three-component structure:

Integration with Agents

  1. Memory Retrieval: Top-k similarity search using embedding-based retrieval (Gemini-embedding-001) to identify relevant past experiences
  2. Memory Construction: LLM-as-a-judge labels trajectories as success/failure; different extraction strategies applied for each type
  3. Memory Consolidation: New memory items incorporated into ReasoningBank, enabling continuous evolution

Memory-aware Test-Time Scaling (MaTTS)

PARALLEL SCALING Generate multiple trajectories for the same query, using self-contrast across different outcomes to identify consistent patterns and filter spurious solutions

SEQUENTIAL SCALING Iteratively refine reasoning within a single trajectory through self-refinement, using intermediate notes as valuable signals for memory synthesis

Key Findings

ReasoningBank Outperforms Existing Memory Mechanisms

Across all tested models and benchmarks, ReasoningBank consistently outperforms both memory-free agents and existing memory baselines:

Learning from Failures is Critical

Ablation studies demonstrate that incorporating failure trajectories significantly boosts performance. ReasoningBank improves from 46.5% (success-only) to 49.7% (with failures), while baselines like AWM actually degrade performance when failures are added (44.4% → 42.2%).

MaTTS Creates Powerful Synergy

Memory-aware test-time scaling establishes a virtuous cycle:

Performance Results

WebArena Benchmark Results

Model Overall SR Overall Steps vs No Memory
Gemini-2.5-flash
No Memory 40.5% 9.7
Synapse 42.1% 9.2 +1.6pp
AWM 44.1% 9.0 +3.6pp
ReasoningBank 48.8% 8.3 +8.3pp
Gemini-2.5-pro
No Memory 46.7% 8.8
ReasoningBank 53.9% 7.4 +7.2pp
Claude-3.7-sonnet
No Memory 41.7% 8.0
ReasoningBank 46.3% 7.3 +4.6pp

SWE-Bench-Verified Results

Method Resolve Rate Avg Steps Improvement
Gemini-2.5-flash
No Memory 34.2% 30.3
Synapse 35.4% 30.7 +1.2pp
ReasoningBank 38.8% 27.5 +4.6pp
Gemini-2.5-pro
No Memory 54.0% 21.1
ReasoningBank 57.4% 19.8 +3.4pp

MaTTS Scaling Results (WebArena-Shopping)

Scaling Factor (k) Parallel Scaling SR Sequential Scaling SR Baseline (No Memory)
k = 1 49.7% 49.7% 39.0%
k = 2 50.3% 51.9% 39.4%
k = 3 52.4% 53.5% 42.2%
k = 4 54.0% 54.0% 41.7%
k = 5 55.1% 54.5% 42.2%

Emergent Behaviors & Strategic Evolution

A remarkable finding is that ReasoningBank exhibits emergent self-evolution, where memory items progressively mature through test-time learning:

  1. Stage 1 - Procedural/Execution: Simple action rules like "actively look for and click on 'Next Page' links"
  2. Stage 2 - Atomic Self-Reflection: Basic verification strategies like "re-check the element's current identifier"
  3. Stage 3 - Evolved Adaptive Checks: Systematic strategies like "leverage available search or filter functionalities, ensure completeness before reporting"
  4. Stage 4 - Generalized Complex Strategies: High-level reasoning like "regularly cross-referencing the current view with task requirements, reassess available options when data doesn't align with expectations"

This evolution resembles reinforcement learning dynamics where agents develop increasingly sophisticated strategies through experience accumulation.

Efficiency Analysis

Breakdown: Successful vs Failed Trajectories

Domain No Memory (Success) ReasoningBank (Success) Reduction
Shopping 6.8 steps 4.7 steps -2.1 steps (30.9%)
Admin 8.4 steps 7.0 steps -1.4 steps (16.7%)
Gitlab 8.6 steps 7.6 steps -1.0 steps (11.6%)
Reddit 6.1 steps 5.0 steps -1.1 steps (18.0%)

Key Insight: ReasoningBank achieves particularly pronounced step reductions on successful cases (up to 30.9%), indicating that memory primarily helps agents reach solutions more efficiently by following effective reasoning paths, rather than simply truncating failed attempts.

Comparison with Existing Approaches

Memory Types Comparison

Approach Memory Content Learns from Failures Abstraction Level
Raw Trajectories
(Synapse)
Complete action sequences ❌ No Low (specific actions)
Workflow Memory
(AWM)
Common successful routines ❌ No Medium (procedural patterns)
ReasoningBank Distilled reasoning strategies ✅ Yes High (generalizable principles)

Implications for Production Systems

System Design Considerations

Performance Budgeting for MaTTS

Organizations can trade compute for reliability using scaling factor k:

Note: Returns begin to diminish after k = 5, suggesting optimal scaling range of 2-5× for most applications.

Technical Architecture Insights

Memory Extraction Pipeline

  1. Trajectory Completion: Agent executes task and generates trajectory
  2. Self-Judgment: LLM-as-a-judge evaluates success/failure without ground truth
  3. Strategy Extraction:
  4. Structured Storage: Store as {title, description, content} with pre-computed embeddings

Self-Contrast in Parallel Scaling

Instead of relying on external quality judges, MaTTS guides the model to directly compare multiple trajectories for the same query, identifying:

Limitations & Future Directions

Current Limitations

Future Research Opportunities

Conclusions

ReasoningBank represents a significant advance in building self-evolving AI agents through memory-driven experience accumulation. Key takeaways:

For practitioners building AI agent systems:

  1. Implement memory systems that capture reasoning strategies, not just action sequences
  2. Design for learning from failures, not just successes
  3. Consider test-time scaling with memory-aware approaches for critical tasks
  4. Architect for continuous self-evolution in persistent deployment scenarios

ReasoningBank opens pathways toward truly adaptive, lifelong-learning agents that improve through experience accumulation, bringing us closer to autonomous systems that naturally evolve and enhance their capabilities over time.

References

Report compiled for AI Agent Engineering Research Collection

For more resources, visit join.maxpool.dev