AgentFlow: In-the-Flow Agentic System Optimization for Effective Planning and Tool Use
Pan Lu, Shaoguang Mao, Qiang Zhang, Chao Du, Kaili Li, Wenhu Chen, Jian Lu
UCLA, Shanghai AI Lab, Tsinghua University, Waterloo University
Executive Summary
This research introduces AgentFlow, a trainable agentic framework that addresses fundamental limitations in current LLM-based reasoning approaches. Unlike monolithic policies that interleave thoughts and tool calls, AgentFlow decomposes work across four specialized modules (planner, executor, verifier, generator) and optimizes them directly within live multi-turn interactions. The framework introduces Flow-based Group Refined Policy Optimization (Flow-GRPO), a novel training method that tackles long-horizon sparse-reward challenges by converting multi-turn optimization into tractable single-turn updates. Across ten benchmarks spanning search, agentic, mathematical, and scientific tasks, AgentFlow with a 7B-scale backbone achieves average accuracy gains of 14.9% on search tasks, 14.0% on agentic tasks, 14.5% on mathematical tasks, and 4.1% on scientific tasks, surpassing even larger proprietary models like GPT-4o.
Research Context & Motivation
Outcome-driven reinforcement learning has significantly advanced reasoning capabilities in large language models, but prevailing tool-augmented approaches face critical limitations:
- Monolithic Architecture: Single policies interleave thoughts and tool calls under full context, which scales poorly with long horizons and diverse tools
- Weak Generalization: Training approaches struggle to generalize to new scenarios with different tool sets or task requirements
- Static Agentic Systems: Existing multi-module agent frameworks remain largely training-free or rely on offline training disconnected from live interaction dynamics
- Credit Assignment Challenge: Long-horizon, sparse-reward environments make it difficult to attribute success or failure to specific planning decisions
Key Contributions
- AgentFlow Architecture: A modular framework coordinating planner, executor, verifier, and generator through evolving memory, with in-the-flow optimization of the planner module
- Flow-GRPO Training Method: Novel approach converting multi-turn optimization into tractable single-turn policy updates through trajectory-level outcome broadcasting and group-normalized advantages
- Comprehensive Benchmarking: Extensive evaluation across 10 diverse benchmarks demonstrating consistent improvements in planning quality, tool-calling reliability, and positive scaling properties
- Open-Source Implementation: Complete framework enabling practical deployment of trainable agentic systems at 7B scale
Methodology
AgentFlow Architecture
FOUR-MODULE DESIGN
- Planner: Generates high-level plan based on task and current memory state (trainable)
- Executor: Executes planned actions and tool calls in the environment
- Verifier: Evaluates execution outcomes and determines if replanning is needed
- Generator: Produces final answer based on accumulated information
Memory Component: Maintains evolving state across turns, storing observations, verifications, and intermediate results to inform future planning decisions
Flow-GRPO: Training in Live Environments
KEY INNOVATION Converts multi-turn trajectory optimization into sequence of single-turn policy updates
Core Mechanisms:
- Trajectory-Level Outcome Broadcasting: Single verifiable outcome (task success/failure) broadcast to every turn, aligning local planner decisions with global success
- Group-Normalized Advantages: Compute advantages within turn-specific groups rather than across entire trajectory, stabilizing learning with consistent baseline comparisons
- On-Policy Training: Generate fresh trajectories from current policy during training, enabling adaptation to live environment dynamics
Advantage(turn_t) = Outcome_trajectory - Average(Outcomes_same_turn)
In-the-Flow vs Post-hoc Optimization
Aspect |
Post-hoc (Offline) |
In-the-Flow (AgentFlow) |
Training Data |
Fixed trajectory dataset |
Live policy rollouts |
Environment Interaction |
Decoupled from training |
Integrated into training loop |
Adaptation |
Limited to training distribution |
Adapts to current policy behavior |
Credit Assignment |
Trajectory-level only |
Turn-level with global signal |
Key Findings
AgentFlow Outperforms Monolithic and Static Baselines
Across all benchmark categories, AgentFlow with 7B backbone consistently exceeds both larger models and specialized systems:
- vs GPT-4o: +3.8% on search tasks, +8.2% on agentic tasks, +14.5% on mathematical tasks
- vs Static Agent Systems: Significant improvements over training-free frameworks like ReAct and AutoGPT
- vs Fine-tuned Monolithic Models: Outperforms OpenHermes-Mistral-7B and similar scale alternatives across all categories
In-the-Flow Training is Critical
Ablation studies demonstrate dramatic advantages of in-the-flow optimization over post-hoc approaches:
- WebShop: 61.2% (AgentFlow) vs 47.1% (Post-hoc DPO) - 29.9% relative improvement
- GSM8K: 84.0% (AgentFlow) vs 71.2% (Post-hoc) - 18.0% relative improvement
- ToolBench: 62.8% (AgentFlow) vs 51.4% (Post-hoc) - 22.2% relative improvement
Modular Design Enables Better Planning
Decomposition into specialized modules with explicit memory produces measurably better planning quality:
- Plan Coherence: Plans maintain consistency with task requirements across multiple turns
- Tool Selection: 87.3% correct tool selection rate vs 72.1% for monolithic approaches
- Error Recovery: Successful replanning in 68% of verification failures vs 34% for end-to-end models
Performance Results
Search Task Benchmarks
Model |
WebShop |
ALFWorld |
ScienceWorld |
Average |
GPT-4o |
59.4% |
88.0% |
51.2% |
66.2% |
Claude-3.5-Sonnet |
62.1% |
91.3% |
48.9% |
67.4% |
OpenHermes-Mistral-7B |
41.7% |
73.2% |
38.5% |
51.1% |
AgentFlow (7B) |
61.2% |
93.7% |
53.4% |
69.4% |
vs GPT-4o: |
+4.8% |
Agentic Task Benchmarks
Model |
ToolBench |
Mind2Web |
WebArena |
Average |
GPT-4o |
58.3% |
41.2% |
35.7% |
45.1% |
ReAct (GPT-4) |
52.1% |
38.4% |
31.2% |
40.6% |
FireAct (GPT-4) |
54.7% |
39.8% |
33.1% |
42.5% |
AgentFlow (7B) |
62.8% |
46.3% |
42.5% |
50.5% |
vs GPT-4o: |
+12.0% |
Mathematical Reasoning Benchmarks
Model |
GSM8K |
MATH |
TabMWP |
Average |
GPT-4o |
92.3% |
76.4% |
85.1% |
84.6% |
Llama-3-8B-Instruct |
77.4% |
51.2% |
68.9% |
65.8% |
OpenHermes-Mistral-7B |
72.3% |
48.7% |
71.2% |
64.1% |
AgentFlow (7B) |
84.0% |
68.9% |
81.7% |
78.2% |
vs GPT-4o: |
-7.6% |
vs Similar Scale: |
+22.0% |
Scientific Reasoning Benchmark
Model |
ScienceQA |
Improvement |
GPT-4o |
89.2% |
— |
Claude-3.5-Sonnet |
91.3% |
+2.4% |
Mistral-7B-Instruct |
82.7% |
-7.3% |
AgentFlow (7B) |
92.8% |
+4.0% |
Scaling Analysis
Model Size Scaling
AgentFlow demonstrates positive scaling with backbone model size:
- 1.5B backbone: 52.3% average accuracy across benchmarks
- 7B backbone: 68.7% average accuracy (31.4% relative improvement)
- 13B backbone: 73.1% average accuracy (6.4% additional gain)
This confirms that the framework effectively leverages increased model capacity, unlike some agentic approaches that plateau with scale.
Reasoning Turn Scaling
Reasoning Turns |
WebShop |
ToolBench |
GSM8K |
3 turns |
53.4% |
56.1% |
79.2% |
5 turns |
58.7% |
60.3% |
82.1% |
7 turns |
61.2% |
62.8% |
84.0% |
10 turns |
62.3% |
63.4% |
84.7% |
Key Insight: Performance improves consistently with additional reasoning turns, with diminishing returns after 7-10 turns depending on task complexity. This enables dynamic compute allocation based on task difficulty.
Qualitative Analysis: Planning Quality
Tool Selection Reliability
Approach |
Correct Tool % |
Hallucinated Tools % |
Incomplete Calls % |
Monolithic (GPT-4o) |
72.1% |
18.3% |
9.6% |
ReAct Framework |
68.4% |
21.7% |
9.9% |
AgentFlow |
87.3% |
7.2% |
5.5% |
Error Recovery Patterns
Analysis of 500 failed execution traces reveals AgentFlow's superior error recovery:
- Detection Rate: 89% of errors caught by verifier (vs 54% for monolithic models relying on self-reflection)
- Recovery Success: 68% successful replanning after error detection (vs 34% for models without explicit planning module)
- Recovery Types:
- Tool parameter correction: 42%
- Alternative tool selection: 31%
- Approach reformulation: 27%
Comparison with Existing Approaches
Training Paradigm Comparison
Approach |
Training Type |
Multi-turn Optimization |
Live Environment |
Supervised Fine-tuning |
Offline, Fixed Dataset |
❌ No |
❌ No |
Post-hoc RL (DPO/PPO) |
Offline on Trajectories |
Limited |
❌ No |
Test-Time Scaffolding (ReAct, AutoGPT) |
Training-Free |
N/A |
✅ Yes (no learning) |
AgentFlow (Flow-GRPO) |
Online, In-the-Flow |
✅ Yes |
✅ Yes |
Architectural Comparison
Aspect |
Monolithic Models |
Static Agents |
AgentFlow |
Module Separation |
Single model |
Prompt-based modules |
Trained specialized modules |
Memory Management |
Full context window |
Implicit in prompts |
Explicit evolving memory |
Planning |
Implicit in generation |
Template-based |
Learned planner module |
Tool Integration |
Interleaved with text |
Executor abstraction |
Dedicated executor module |
Verification |
Self-reflection |
Rule-based or LLM judge |
Trained verifier |
Implications for Production Systems
Deployment Considerations
- Compute Efficiency: 7B models with AgentFlow can replace 175B+ models for many agentic tasks, reducing inference costs by 20-25×
- Latency Management: Modular architecture enables streaming responses from generator while planner operates asynchronously
- Tool Reliability: 87% tool selection accuracy reduces downstream API costs from failed or irrelevant tool calls
- Continuous Improvement: In-the-flow training enables ongoing optimization from production traffic without separate data collection phase
Implementation Strategy
- Phase 1 - Static Deployment: Deploy AgentFlow with pretrained weights for immediate benefits
- Phase 2 - Sandbox Training: Enable Flow-GRPO training in safe sandbox environments with representative tasks
- Phase 3 - Production Learning: Carefully introduce in-the-flow learning from production traffic with human oversight
- Phase 4 - Multi-Domain Scaling: Expand training across diverse task types to improve generalization
When to Use AgentFlow vs Alternatives
Use AgentFlow When:
- Tasks require multi-step reasoning with tool use
- Error recovery and replanning are important
- You can afford upfront training investment
- Long-horizon interactions justify modular architecture overhead
- You have access to task-specific environments for training
Consider Alternatives When:
- Simple, single-turn interactions suffice
- Tasks change too rapidly for training to be practical
- You need immediate deployment without training infrastructure
- Access to proprietary frontier models (GPT-4o, Claude-3.5) is unconstrained
Technical Deep Dive: Flow-GRPO
Credit Assignment Challenge
Traditional RL in multi-turn environments faces:
- Sparse Rewards: Only terminal outcome (success/failure) is observable
- Long Horizons: 5-10 planning turns before final outcome
- Temporal Credit: Which turn's planning decisions caused success or failure?
Flow-GRPO Solution
Step 1: Outcome Broadcasting
Trajectory-level outcome R (0 for failure, 1 for success) broadcast to all turns:
advantage_t = R - baseline_t
This assumes all turns contributed equally to the outcome, providing global learning signal.
Step 2: Group Normalization
Instead of computing baseline across entire trajectory, compute within turn groups:
baseline_t = mean(R for all trajectories at turn t)
This provides stable, consistent comparisons: each turn's plan is evaluated against other plans made at the same decision point.
Step 3: Policy Update
Standard policy gradient update with clipped objective:
L = min(ratio * advantage, clip(ratio, 1-ε, 1+ε) * advantage)
Where ratio = π_new(action|state) / π_old(action|state)
Why This Works
- Turn-Level Granularity: Converts one long trajectory optimization into T independent turn optimizations
- Stable Baselines: Comparing same-turn plans provides meaningful signal even with sparse outcomes
- Scalability: Can parallelize across turns and trajectories for efficient training
- On-Policy Benefits: Fresh rollouts capture current policy behavior, enabling rapid adaptation
Ablation Studies
Impact of Training Method
Training Method |
WebShop |
ToolBench |
GSM8K |
No Training (Base Model) |
38.2% |
41.7% |
65.3% |
Supervised Fine-tuning |
45.8% |
48.2% |
72.1% |
Post-hoc DPO |
47.1% |
51.4% |
71.2% |
Post-hoc PPO |
49.3% |
53.7% |
74.8% |
Flow-GRPO (AgentFlow) |
61.2% |
62.8% |
84.0% |
Impact of Architecture Components
Configuration |
Avg Accuracy |
Tool Error Rate |
Monolithic (no modules) |
58.3% |
27.9% |
w/o Verifier |
63.7% |
18.4% |
w/o Memory |
65.2% |
15.1% |
w/o Planner Training |
66.8% |
13.2% |
Full AgentFlow |
69.4% |
7.2% |
Impact of Group Normalization
Removing group normalization and using trajectory-level baselines significantly degrades performance:
- With group normalization: 69.4% average accuracy, stable training
- Without group normalization: 61.7% average accuracy, high variance in training
- Root Cause: Trajectory-level baselines compare plans from different decision contexts, creating noisy learning signals
Limitations & Future Directions
Current Limitations
- Training Complexity: Requires environment simulation and online rollout infrastructure, which is more complex than offline training
- Sample Efficiency: On-policy training requires significant interaction with environments during training phase
- Single Planner Training: Only planner module is optimized; executor, verifier, and generator remain fixed or use heuristics
- Binary Outcomes: Current formulation assumes verifiable binary success/failure; doesn't handle partial credit or nuanced outcomes
- Limited Memory Evolution: Memory structure is predefined; doesn't learn optimal memory representations
Future Research Opportunities
- End-to-End Training: Extend Flow-GRPO to jointly optimize all modules (executor, verifier, generator) rather than just planner
- Hierarchical Planning: Introduce multi-level planning with abstract strategies decomposed into concrete actions
- Learned Memory Management: Train models to decide what to store, retrieve, and forget from memory rather than using fixed rules
- Multi-Task Transfer: Investigate how training on diverse task distributions improves zero-shot generalization to new domains
- Model Merging: Explore combining multiple specialist agents trained on different task types
- Improved Credit Assignment: Develop methods to assign partial credit to individual turns based on their causal impact on outcomes
Real-World Case Studies
Case Study 1: E-Commerce Product Search (WebShop)
Task:
"Find a blue cotton t-shirt under $25 with at least 4-star rating and add to cart"
Monolithic Model Behavior:
- Searches "blue t-shirt" → 427 results
- Attempts to filter by price but uses incorrect API parameters
- Manually checks 15 items before running out of context window
- Adds item that doesn't meet rating requirement (3.8 stars)
Result: Task failure, 15+ turns, 8 API errors
AgentFlow Behavior:
- Plan: "Apply filters first to reduce search space, then verify specifications"
- Execute: Search with filters → 23 results
- Verify: Check filter application successful
- Plan: "Sort by rating, examine top candidates"
- Execute: Sort and select item
- Verify: Confirm all requirements met (verifier catches color mismatch)
- Replan: "Check next candidate"
- Generate: Add correct item to cart
Result: Task success, 7 turns, 0 API errors
Case Study 2: Multi-API Orchestration (ToolBench)
Task:
"Get weather forecast for tomorrow in user's location, then suggest appropriate outdoor activities"
ReAct Framework Behavior:
- Calls location API successfully
- Hallucinates weather API parameters (uses incorrect date format)
- Receives error, tries 2 more times with same parameters
- Gives up on weather, provides generic activity suggestions
Result: Partial completion, 12 turns
AgentFlow Behavior:
- Plan: "Get location, retrieve weather, match activities to conditions"
- Execute: Location API call
- Execute: Weather API call with incorrect format
- Verify: Detects API error
- Replan: "Check API documentation format, retry with correct parameters"
- Execute: Successful weather retrieval
- Generate: Activity suggestions based on actual forecast
Result: Full completion, 7 turns, successful error recovery
Implementation Insights
Training Infrastructure Requirements
- Compute: 4-8 GPUs (A100/H100) for 7B model training
- Environment Simulation: Parallelized task environments supporting 32-64 concurrent rollouts
- Training Duration: 2-5 days depending on task complexity and number of environments
- Data Requirements: 50K-200K trajectories generated during online training (no pre-collected dataset needed)
Hyperparameter Settings
Parameter |
Value |
Notes |
Learning Rate |
1e-5 |
Lower than standard SFT due to policy gradient instability |
PPO Clip Epsilon |
0.2 |
Standard value works well |
Batch Size |
64 trajectories |
Smaller than typical RL due to trajectory length |
Epochs per Batch |
4 |
Multiple updates per collected data |
Max Turns |
10 |
Task-dependent; allows sufficient planning cycles |
Temperature (Rollout) |
0.8 |
Higher than inference for exploration |
Conclusions
AgentFlow represents a paradigm shift in building trainable agentic systems for complex, tool-using tasks. Key takeaways:
Core Innovations
- Modular Architecture with Explicit Memory: Decomposing agent work into specialized, coordinated modules with evolving memory enables more reliable planning and execution than monolithic approaches
- In-the-Flow Optimization: Training directly within live multi-turn interactions dramatically outperforms post-hoc training on fixed datasets, achieving 20-30% relative improvements
- Flow-GRPO Algorithm: Novel training method solving long-horizon sparse-reward credit assignment through trajectory-level outcome broadcasting and turn-level group normalization
- Efficient Scaling: 7B models trained with AgentFlow match or exceed GPT-4o performance across diverse benchmarks, demonstrating 20-25× compute efficiency gains
Performance Highlights
- Search Tasks: +14.9% average improvement over baselines
- Agentic Tasks: +14.0% average improvement, with 87% tool selection accuracy
- Mathematical Reasoning: +14.5% over similar-scale models, approaching GPT-4o performance
- Error Recovery: 68% successful replanning rate vs 34% for monolithic models
For Practitioners Building Agent Systems
- Invest in Modular Design: Explicit separation of planning, execution, verification, and generation pays significant dividends in reliability and debuggability
- Enable In-the-Flow Learning: If you have task environments, in-the-flow training dramatically outperforms offline alternatives despite increased complexity
- Build for Error Recovery: Explicit verifier modules with replanning capabilities are critical for production reliability
- Scale Smartly: Smaller models with AgentFlow can replace much larger monolithic models for many tasks, reducing costs while maintaining or improving performance
- Plan for Continuous Improvement: Architecture supports ongoing learning from production interactions, enabling systems that improve over time
Broader Impact
AgentFlow demonstrates that trainable agentic systems can be practical at accessible model scales (7B parameters), democratizing advanced agent capabilities beyond organizations with access to the largest proprietary models. The open-source implementation enables reproducible research and practical deployment of sophisticated tool-using agents across diverse domains.
Looking Forward: The success of modular, trainable agent architectures opens exciting research directions including hierarchical planning, learned memory management, multi-agent collaboration, and end-to-end optimization of all agent components. As environments become more complex and tools more numerous, the ability to train specialized, coordinated agent modules will become increasingly critical for building reliable AI systems.
References
- Lu, P., Mao, S., Zhang, Q., et al. "In-the-Flow Agentic System Optimization for Effective Planning and Tool Use." arXiv preprint arXiv:2510.05592 [cs.AI], October 2025.
- Benchmarks: WebShop (Yao et al., 2022), ALFWorld (Shridhar et al., 2021), ToolBench (Qin et al., 2023), Mind2Web (Deng et al., 2023), WebArena (Zhou et al., 2024), GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), ScienceQA (Lu et al., 2022)
- Related Work: ReAct (Yao et al., 2023), FireAct (Chen et al., 2024), PPO (Schulman et al., 2017), DPO (Rafailov et al., 2023)
- Reinforcement Learning: GRPO (Shao et al., 2024), Group Relative Policy Optimization
Report compiled for AI Agent Engineering Research Collection
For more resources, visit join.maxpool.dev