AgentFlow: In-the-Flow Agentic System Optimization for Effective Planning and Tool Use

Pan Lu, Shaoguang Mao, Qiang Zhang, Chao Du, Kaili Li, Wenhu Chen, Jian Lu
UCLA, Shanghai AI Lab, Tsinghua University, Waterloo University

Executive Summary

This research introduces AgentFlow, a trainable agentic framework that addresses fundamental limitations in current LLM-based reasoning approaches. Unlike monolithic policies that interleave thoughts and tool calls, AgentFlow decomposes work across four specialized modules (planner, executor, verifier, generator) and optimizes them directly within live multi-turn interactions. The framework introduces Flow-based Group Refined Policy Optimization (Flow-GRPO), a novel training method that tackles long-horizon sparse-reward challenges by converting multi-turn optimization into tractable single-turn updates. Across ten benchmarks spanning search, agentic, mathematical, and scientific tasks, AgentFlow with a 7B-scale backbone achieves average accuracy gains of 14.9% on search tasks, 14.0% on agentic tasks, 14.5% on mathematical tasks, and 4.1% on scientific tasks, surpassing even larger proprietary models like GPT-4o.

Research Context & Motivation

Outcome-driven reinforcement learning has significantly advanced reasoning capabilities in large language models, but prevailing tool-augmented approaches face critical limitations:

Key Contributions

  1. AgentFlow Architecture: A modular framework coordinating planner, executor, verifier, and generator through evolving memory, with in-the-flow optimization of the planner module
  2. Flow-GRPO Training Method: Novel approach converting multi-turn optimization into tractable single-turn policy updates through trajectory-level outcome broadcasting and group-normalized advantages
  3. Comprehensive Benchmarking: Extensive evaluation across 10 diverse benchmarks demonstrating consistent improvements in planning quality, tool-calling reliability, and positive scaling properties
  4. Open-Source Implementation: Complete framework enabling practical deployment of trainable agentic systems at 7B scale

Methodology

AgentFlow Architecture

FOUR-MODULE DESIGN

Memory Component: Maintains evolving state across turns, storing observations, verifications, and intermediate results to inform future planning decisions

Flow-GRPO: Training in Live Environments

KEY INNOVATION Converts multi-turn trajectory optimization into sequence of single-turn policy updates

Core Mechanisms:

  1. Trajectory-Level Outcome Broadcasting: Single verifiable outcome (task success/failure) broadcast to every turn, aligning local planner decisions with global success
  2. Group-Normalized Advantages: Compute advantages within turn-specific groups rather than across entire trajectory, stabilizing learning with consistent baseline comparisons
  3. On-Policy Training: Generate fresh trajectories from current policy during training, enabling adaptation to live environment dynamics
Advantage(turn_t) = Outcome_trajectory - Average(Outcomes_same_turn)

In-the-Flow vs Post-hoc Optimization

Aspect Post-hoc (Offline) In-the-Flow (AgentFlow)
Training Data Fixed trajectory dataset Live policy rollouts
Environment Interaction Decoupled from training Integrated into training loop
Adaptation Limited to training distribution Adapts to current policy behavior
Credit Assignment Trajectory-level only Turn-level with global signal

Key Findings

AgentFlow Outperforms Monolithic and Static Baselines

Across all benchmark categories, AgentFlow with 7B backbone consistently exceeds both larger models and specialized systems:

In-the-Flow Training is Critical

Ablation studies demonstrate dramatic advantages of in-the-flow optimization over post-hoc approaches:

Modular Design Enables Better Planning

Decomposition into specialized modules with explicit memory produces measurably better planning quality:

Performance Results

Search Task Benchmarks

Model WebShop ALFWorld ScienceWorld Average
GPT-4o 59.4% 88.0% 51.2% 66.2%
Claude-3.5-Sonnet 62.1% 91.3% 48.9% 67.4%
OpenHermes-Mistral-7B 41.7% 73.2% 38.5% 51.1%
AgentFlow (7B) 61.2% 93.7% 53.4% 69.4%
vs GPT-4o: +4.8%

Agentic Task Benchmarks

Model ToolBench Mind2Web WebArena Average
GPT-4o 58.3% 41.2% 35.7% 45.1%
ReAct (GPT-4) 52.1% 38.4% 31.2% 40.6%
FireAct (GPT-4) 54.7% 39.8% 33.1% 42.5%
AgentFlow (7B) 62.8% 46.3% 42.5% 50.5%
vs GPT-4o: +12.0%

Mathematical Reasoning Benchmarks

Model GSM8K MATH TabMWP Average
GPT-4o 92.3% 76.4% 85.1% 84.6%
Llama-3-8B-Instruct 77.4% 51.2% 68.9% 65.8%
OpenHermes-Mistral-7B 72.3% 48.7% 71.2% 64.1%
AgentFlow (7B) 84.0% 68.9% 81.7% 78.2%
vs GPT-4o: -7.6%
vs Similar Scale: +22.0%

Scientific Reasoning Benchmark

Model ScienceQA Improvement
GPT-4o 89.2%
Claude-3.5-Sonnet 91.3% +2.4%
Mistral-7B-Instruct 82.7% -7.3%
AgentFlow (7B) 92.8% +4.0%

Scaling Analysis

Model Size Scaling

AgentFlow demonstrates positive scaling with backbone model size:

This confirms that the framework effectively leverages increased model capacity, unlike some agentic approaches that plateau with scale.

Reasoning Turn Scaling

Reasoning Turns WebShop ToolBench GSM8K
3 turns 53.4% 56.1% 79.2%
5 turns 58.7% 60.3% 82.1%
7 turns 61.2% 62.8% 84.0%
10 turns 62.3% 63.4% 84.7%

Key Insight: Performance improves consistently with additional reasoning turns, with diminishing returns after 7-10 turns depending on task complexity. This enables dynamic compute allocation based on task difficulty.

Qualitative Analysis: Planning Quality

Tool Selection Reliability

Approach Correct Tool % Hallucinated Tools % Incomplete Calls %
Monolithic (GPT-4o) 72.1% 18.3% 9.6%
ReAct Framework 68.4% 21.7% 9.9%
AgentFlow 87.3% 7.2% 5.5%

Error Recovery Patterns

Analysis of 500 failed execution traces reveals AgentFlow's superior error recovery:

Comparison with Existing Approaches

Training Paradigm Comparison

Approach Training Type Multi-turn Optimization Live Environment
Supervised Fine-tuning Offline, Fixed Dataset ❌ No ❌ No
Post-hoc RL (DPO/PPO) Offline on Trajectories Limited ❌ No
Test-Time Scaffolding
(ReAct, AutoGPT)
Training-Free N/A ✅ Yes (no learning)
AgentFlow (Flow-GRPO) Online, In-the-Flow ✅ Yes ✅ Yes

Architectural Comparison

Aspect Monolithic Models Static Agents AgentFlow
Module Separation Single model Prompt-based modules Trained specialized modules
Memory Management Full context window Implicit in prompts Explicit evolving memory
Planning Implicit in generation Template-based Learned planner module
Tool Integration Interleaved with text Executor abstraction Dedicated executor module
Verification Self-reflection Rule-based or LLM judge Trained verifier

Implications for Production Systems

Deployment Considerations

Implementation Strategy

  1. Phase 1 - Static Deployment: Deploy AgentFlow with pretrained weights for immediate benefits
  2. Phase 2 - Sandbox Training: Enable Flow-GRPO training in safe sandbox environments with representative tasks
  3. Phase 3 - Production Learning: Carefully introduce in-the-flow learning from production traffic with human oversight
  4. Phase 4 - Multi-Domain Scaling: Expand training across diverse task types to improve generalization

When to Use AgentFlow vs Alternatives

Use AgentFlow When:

Consider Alternatives When:

Technical Deep Dive: Flow-GRPO

Credit Assignment Challenge

Traditional RL in multi-turn environments faces:

Flow-GRPO Solution

Step 1: Outcome Broadcasting

Trajectory-level outcome R (0 for failure, 1 for success) broadcast to all turns:

advantage_t = R - baseline_t

This assumes all turns contributed equally to the outcome, providing global learning signal.

Step 2: Group Normalization

Instead of computing baseline across entire trajectory, compute within turn groups:

baseline_t = mean(R for all trajectories at turn t)

This provides stable, consistent comparisons: each turn's plan is evaluated against other plans made at the same decision point.

Step 3: Policy Update

Standard policy gradient update with clipped objective:

L = min(ratio * advantage, clip(ratio, 1-ε, 1+ε) * advantage)

Where ratio = π_new(action|state) / π_old(action|state)

Why This Works

Ablation Studies

Impact of Training Method

Training Method WebShop ToolBench GSM8K
No Training (Base Model) 38.2% 41.7% 65.3%
Supervised Fine-tuning 45.8% 48.2% 72.1%
Post-hoc DPO 47.1% 51.4% 71.2%
Post-hoc PPO 49.3% 53.7% 74.8%
Flow-GRPO (AgentFlow) 61.2% 62.8% 84.0%

Impact of Architecture Components

Configuration Avg Accuracy Tool Error Rate
Monolithic (no modules) 58.3% 27.9%
w/o Verifier 63.7% 18.4%
w/o Memory 65.2% 15.1%
w/o Planner Training 66.8% 13.2%
Full AgentFlow 69.4% 7.2%

Impact of Group Normalization

Removing group normalization and using trajectory-level baselines significantly degrades performance:

Limitations & Future Directions

Current Limitations

Future Research Opportunities

Real-World Case Studies

Case Study 1: E-Commerce Product Search (WebShop)

Task:

"Find a blue cotton t-shirt under $25 with at least 4-star rating and add to cart"

Monolithic Model Behavior:

  1. Searches "blue t-shirt" → 427 results
  2. Attempts to filter by price but uses incorrect API parameters
  3. Manually checks 15 items before running out of context window
  4. Adds item that doesn't meet rating requirement (3.8 stars)

Result: Task failure, 15+ turns, 8 API errors

AgentFlow Behavior:

  1. Plan: "Apply filters first to reduce search space, then verify specifications"
  2. Execute: Search with filters → 23 results
  3. Verify: Check filter application successful
  4. Plan: "Sort by rating, examine top candidates"
  5. Execute: Sort and select item
  6. Verify: Confirm all requirements met (verifier catches color mismatch)
  7. Replan: "Check next candidate"
  8. Generate: Add correct item to cart

Result: Task success, 7 turns, 0 API errors

Case Study 2: Multi-API Orchestration (ToolBench)

Task:

"Get weather forecast for tomorrow in user's location, then suggest appropriate outdoor activities"

ReAct Framework Behavior:

  1. Calls location API successfully
  2. Hallucinates weather API parameters (uses incorrect date format)
  3. Receives error, tries 2 more times with same parameters
  4. Gives up on weather, provides generic activity suggestions

Result: Partial completion, 12 turns

AgentFlow Behavior:

  1. Plan: "Get location, retrieve weather, match activities to conditions"
  2. Execute: Location API call
  3. Execute: Weather API call with incorrect format
  4. Verify: Detects API error
  5. Replan: "Check API documentation format, retry with correct parameters"
  6. Execute: Successful weather retrieval
  7. Generate: Activity suggestions based on actual forecast

Result: Full completion, 7 turns, successful error recovery

Implementation Insights

Training Infrastructure Requirements

Hyperparameter Settings

Parameter Value Notes
Learning Rate 1e-5 Lower than standard SFT due to policy gradient instability
PPO Clip Epsilon 0.2 Standard value works well
Batch Size 64 trajectories Smaller than typical RL due to trajectory length
Epochs per Batch 4 Multiple updates per collected data
Max Turns 10 Task-dependent; allows sufficient planning cycles
Temperature (Rollout) 0.8 Higher than inference for exploration

Conclusions

AgentFlow represents a paradigm shift in building trainable agentic systems for complex, tool-using tasks. Key takeaways:

Core Innovations

Performance Highlights

For Practitioners Building Agent Systems

  1. Invest in Modular Design: Explicit separation of planning, execution, verification, and generation pays significant dividends in reliability and debuggability
  2. Enable In-the-Flow Learning: If you have task environments, in-the-flow training dramatically outperforms offline alternatives despite increased complexity
  3. Build for Error Recovery: Explicit verifier modules with replanning capabilities are critical for production reliability
  4. Scale Smartly: Smaller models with AgentFlow can replace much larger monolithic models for many tasks, reducing costs while maintaining or improving performance
  5. Plan for Continuous Improvement: Architecture supports ongoing learning from production interactions, enabling systems that improve over time

Broader Impact

AgentFlow demonstrates that trainable agentic systems can be practical at accessible model scales (7B parameters), democratizing advanced agent capabilities beyond organizations with access to the largest proprietary models. The open-source implementation enables reproducible research and practical deployment of sophisticated tool-using agents across diverse domains.

Looking Forward: The success of modular, trainable agent architectures opens exciting research directions including hierarchical planning, learned memory management, multi-agent collaboration, and end-to-end optimization of all agent components. As environments become more complex and tools more numerous, the ability to train specialized, coordinated agent modules will become increasingly critical for building reliable AI systems.

References

Report compiled for AI Agent Engineering Research Collection

For more resources, visit join.maxpool.dev