Agentic Context Engineering:
Evolving Contexts for Self-Improving Language Models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong,
Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li,
Urmish Thakker, James Zou, Kunle Olukotun
Stanford University, SambaNova Systems, UC Berkeley
October 2024

Executive Summary

This paper introduces ACE (Agentic Context Engineering), a revolutionary approach to context adaptation that treats prompts as "evolving playbooks" rather than static templates. Unlike existing methods that compress knowledge into brief summaries, ACE accumulates detailed strategies and domain insights through generation, reflection, and curation cycles. This approach achieves +10.6% average improvement on agent tasks and +17.1% on online adaptation challenges.

The key innovation lies in ACE's incremental delta update mechanism, which enables 82.3% reduction in adaptation latency versus GEPA and 83.6% token cost reduction versus Dynamic Cheatsheet. Most remarkably, ACE matches top-ranked production GPT-4.1 agent performance despite using smaller open-source models, demonstrating that effective context engineering can overcome raw model scale limitations.

🎯 ELI5: Living Documentation

Imagine if your instruction manual could learn from every mistake and success, automatically adding new tips and removing outdated advice. That's ACE—it treats AI prompts like living documents that evolve with experience. Instead of rewriting the entire manual each time (which loses details), ACE adds small "sticky notes" with lessons learned, eventually organizing them into a comprehensive playbook that gets better with every use.

Part 1: The Context Collapse Problem

Current approaches to context adaptation suffer from two critical failures that ACE addresses:

The Twin Failures of Context Optimization

  1. Brevity Bias: Optimization algorithms inherently favor shorter prompts, systematically removing domain-specific details that are crucial for specialized tasks. A prompt that starts with detailed financial calculation rules gradually degrades to generic "be accurate with numbers."
  2. Context Collapse: When LLMs rewrite entire contexts, they compress rich procedural knowledge into vague summaries. Specific error handling strategies like "retry API calls with exponential backoff starting at 1 second" become meaningless platitudes like "handle errors appropriately."

These failures become particularly acute in domain-specific applications. Financial reasoning tasks that require understanding of XBRL (eXtensible Business Reporting Language) standards see performance drops of -15% to -25% when contexts are "optimized" using traditional methods. The optimization process strips away the very details that make the context valuable.

Context Collapse Visualization
Figure 1: Context Collapse - Monolithic LLM rewriting collapses detailed context into shorter, less informative summaries, causing sharp performance drops.

Why Traditional Approaches Fail

Traditional prompt optimization methods like MIPROv2 and GEPA treat contexts as monolithic objects to be rewritten wholesale. This approach has several fundamental flaws:

Problems with Monolithic Rewrites

Part 2: The ACE Architecture

ACE introduces a three-component architecture that treats contexts as collections of discrete, versioned insights rather than monolithic documents:

The ACE Framework Architecture
Figure 2: The ACE Framework - Three specialized components (Generator, Reflector, and Curator) work together to evolve contexts through experience.

Component 1: The Generator

The Generator executes tasks using the current context, producing detailed reasoning trajectories. Unlike traditional approaches that only capture final outputs, the Generator records:

This rich execution trace provides the raw material for learning. A single task might generate dozens of potential insights about what works, what fails, and under what conditions.

Component 2: The Reflector

The Reflector analyzes execution traces to extract actionable insights. This isn't simple pattern matching—it's a sophisticated analysis process that identifies:

Reflection Categories

The Reflector uses iterative refinement, generating multiple candidate insights and filtering them for specificity, actionability, and non-redundancy. Generic observations like "be careful with calculations" are rejected in favor of specific guidance like "XBRL percentage fields must be divided by 100 before display."

Component 3: The Curator

The Curator manages the growing collection of insights, implementing ACE's key innovation: incremental delta updates. Rather than rewriting the entire context, the Curator:

Context_new = Context_old ∪ Δ(insights) - Deprecated_insights

Each insight is stored as a structured bullet with metadata:

Delta Entry Structure

Part 3: Performance Analysis

ACE was evaluated on three challenging benchmark suites that test different aspects of agent capabilities:

ACE Performance Results
Figure 3: Overall Performance Results - ACE consistently outperforms baseline methods across agent tasks and domain-specific reasoning benchmarks.
Benchmark Task Type Baseline ACE Improvement
AppWorld API understanding & interaction 41.2% 48.3% +17.1%
FiNER Financial reasoning with XBRL 62.8% 68.2% +8.6%
Formula Complex calculations 71.4% 77.6% +8.7%

Efficiency Gains

Beyond accuracy improvements, ACE demonstrates remarkable efficiency advantages:

Metric vs. GEPA vs. Dynamic Cheatsheet vs. MIPROv2
Adaptation Latency -82.3% -91.5% -76.2%
Token Cost -71.4% -83.6% -68.9%
Rollouts Required -75.1% -62.3% -70.8%

Part 4: Scaling Dynamics

One of ACE's most surprising findings is how context quality scales with experience:

The Compound Learning Effect

Unlike traditional methods that plateau quickly, ACE shows compound improvements over time. After processing 1,000 tasks:

This compound effect occurs because insights build on each other. Early insights might identify common failure modes, middle-stage insights develop workarounds, and late-stage insights optimize these workarounds for efficiency.

ACE-Generated Context Example
Figure 4: ACE-Generated Context Example from AppWorld - Shows detailed, domain-specific insights and usable code accumulated as a comprehensive playbook.

Context Evolution Patterns

Analysis of context evolution reveals distinct phases:

Context Maturity Stages

Tasks 1-100: Basic pattern recognition, identifying obvious failure modes
Tasks 100-300: Strategy development, finding successful approaches
Tasks 300-600: Refinement, optimizing strategies and handling edge cases
Tasks 600-1000: Consolidation, merging related insights and removing redundancy
Tasks 1000+: Expertise, handling rare scenarios and optimizing for speed

Part 5: Ablation Studies

To understand which components contribute most to ACE's performance, systematic ablations were performed:

Configuration AppWorld FiNER Impact
Full ACE 48.3% 68.2% Baseline
Without Reflector 44.1% 64.7% -8.7%
Without Delta Updates 45.2% 65.3% -6.4%
Without Utility Tracking 47.1% 67.4% -2.5%
Single-Epoch Reflection 46.8% 66.9% -3.1%

Critical Components

The ablation study reveals that the Reflector is the most critical component, contributing nearly 9% to overall performance. This suggests that the quality of insight extraction matters more than the mechanism of storage (delta updates) or tracking (utility counts).

Part 6: Comparison with Fine-Tuning

A natural question is whether ACE's benefits could be achieved through traditional fine-tuning. The paper presents a compelling comparison:

ACE Advantages Over Fine-Tuning

Perhaps most importantly, ACE with Llama-3-70B matches GPT-4.1's performance on AppWorld, demonstrating that sophisticated context engineering can close the gap between model sizes. This has profound implications for organizations that can't afford cutting-edge models or have privacy requirements preventing API usage.

AppWorld Leaderboard
Figure 5: AppWorld Leaderboard (September 20, 2025) - ACE with Llama-3-70B achieves competitive performance with top production systems.

Part 7: Implementation Considerations

Production Deployment Guidelines

Based on the paper's findings, several best practices emerge for deploying ACE in production:

Implementation Checklist

  1. Start Small: Begin with 10-20 seed insights rather than empty context
  2. Batch Processing: Accumulate 10-20 task executions before reflection for efficiency
  3. Periodic Cleanup: Run de-duplication every 200-300 insights
  4. Monitor Utility: Remove insights with harmful_count > helpful_count * 2
  5. Version Control: Track context evolution for debugging and rollback

Computational Requirements

ACE's resource requirements are surprisingly modest:

This means ACE can run on the same infrastructure as standard LLM deployments, requiring only ~60% additional tokens during the learning phase, dropping to near-zero overhead once the context stabilizes.

Part 8: Limitations and Future Work

Current Limitations

Known Constraints

Future Research Directions

The paper identifies several promising avenues for future work:

  1. Hierarchical Contexts: Multi-level organization with general → specific insights
  2. Cross-Task Transfer: Learning meta-strategies that apply across domains
  3. Active Learning: Deliberately seeking tasks that maximize learning rate
  4. Multi-Agent Contexts: Shared knowledge bases for agent teams

Conclusion

Agentic Context Engineering represents a paradigm shift in how we think about LLM adaptation. By treating contexts as evolving repositories of structured knowledge rather than static prompts, ACE achieves remarkable improvements in both performance and efficiency.

The key insights from this work are:

For practitioners, ACE offers a practical path to continuous improvement without the complexity and cost of fine-tuning. For researchers, it opens new questions about knowledge representation, transfer learning, and the fundamental nature of in-context learning.

The future of LLM applications may not lie in ever-larger models, but in ever-smarter contexts that accumulate and organize the lessons of experience.

Primary Source

ACE: Agentic Context Engineering
Comprehensive framework for evolving contexts through generation, reflection, and curation cycles.