AgentFlow: In-the-Flow Agentic System Optimization

Executive Summary

This research introduces AgentFlow, a trainable agentic framework that addresses fundamental limitations in current LLM-based reasoning approaches. Unlike monolithic policies that interleave thoughts and tool calls, AgentFlow decomposes work across four specialized modules (planner, executor, verifier, generator) and optimizes them directly within live multi-turn interactions. The framework introduces Flow-based Group Refined Policy Optimization (Flow-GRPO), a novel training method that tackles long-horizon sparse-reward challenges by converting multi-turn optimization into tractable single-turn updates. Across ten benchmarks spanning search, agentic, mathematical, and scientific tasks, AgentFlow with a 7B-scale backbone achieves average accuracy gains of 14.9% on search tasks, 14.0% on agentic tasks, 14.5% on mathematical tasks, and 4.1% on scientific tasks, surpassing even larger proprietary models like GPT-4o.

Research Context & Motivation

Outcome-driven reinforcement learning has significantly advanced reasoning capabilities in large language models, but prevailing tool-augmented approaches face critical limitations:

Key Contributions

Methodology

AgentFlow Architecture

FOUR-MODULE DESIGN

Planner: Generates high-level plan based on task and current memory state (trainable)
Executor: Executes planned actions and tool calls in the environment
Verifier: Evaluates execution outcomes and determines if replanning is needed
Generator: Produces final answer based on accumulated information

Memory Component: Maintains evolving state across turns, storing observations, verifications, and intermediate results to inform future planning decisions

Flow-GRPO: Training in Live Environments

KEY INNOVATION Converts multi-turn trajectory optimization into sequence of single-turn policy updates

Core Mechanisms:

Trajectory-Level Outcome Broadcasting: Single verifiable outcome (task success/failure) broadcast to every turn, aligning local planner decisions with global success
Group-Normalized Advantages: Compute advantages within turn-specific groups rather than across entire trajectory, stabilizing learning with consistent baseline comparisons
On-Policy Training: Generate fresh trajectories from current policy during training, enabling adaptation to live environment dynamics

Advantage(turn_t) = Outcome_trajectory - Average(Outcomes_same_turn)

Key Findings

Performance Results

Search Task Benchmarks

Agentic Task Benchmarks

Mathematical Reasoning Benchmarks

Scientific Reasoning Benchmark

Scaling Analysis

Model Size Scaling

Reasoning Turn Scaling

Aspect	Post-hoc (Offline)	In-the-Flow (AgentFlow)
Training Data	Fixed trajectory dataset	Live policy rollouts
Environment Interaction	Decoupled from training	Integrated into training loop
Adaptation	Limited to training distribution	Adapts to current policy behavior
Credit Assignment	Trajectory-level only	Turn-level with global signal

Model	WebShop	ALFWorld	ScienceWorld	Average
GPT-4o	59.4%	88.0%	51.2%	66.2%
Claude-3.5-Sonnet	62.1%	91.3%	48.9%	67.4%
OpenHermes-Mistral-7B	41.7%	73.2%	38.5%	51.1%
AgentFlow (7B)	61.2%	93.7%	53.4%	69.4%
vs GPT-4o:	+4.8%

Model	ToolBench	Mind2Web	WebArena	Average
GPT-4o	58.3%	41.2%	35.7%	45.1%
ReAct (GPT-4)	52.1%	38.4%	31.2%	40.6%
FireAct (GPT-4)	54.7%	39.8%	33.1%	42.5%
AgentFlow (7B)	62.8%	46.3%	42.5%	50.5%
vs GPT-4o:	+12.0%

Model	GSM8K	MATH	TabMWP	Average
GPT-4o	92.3%	76.4%	85.1%	84.6%
Llama-3-8B-Instruct	77.4%	51.2%	68.9%	65.8%
OpenHermes-Mistral-7B	72.3%	48.7%	71.2%	64.1%
AgentFlow (7B)	84.0%	68.9%	81.7%	78.2%
vs GPT-4o:	-7.6%
vs Similar Scale:	+22.0%

Model	ScienceQA	Improvement
GPT-4o	89.2%	—
Claude-3.5-Sonnet	91.3%	+2.4%
Mistral-7B-Instruct	82.7%	-7.3%
AgentFlow (7B)	92.8%	+4.0%

Reasoning Turns	WebShop	ToolBench	GSM8K
3 turns	53.4%	56.1%	79.2%
5 turns	58.7%	60.3%	82.1%
7 turns	61.2%	62.8%	84.0%
10 turns	62.3%	63.4%	84.7%

Key Insight: Performance improves consistently with additional reasoning turns, with diminishing returns after 7-10 turns depending on task complexity. This enables dynamic compute allocation based on task difficulty.

Qualitative Analysis: Planning Quality

Tool Selection Reliability

Error Recovery Patterns

Analysis of 500 failed execution traces reveals AgentFlow's superior error recovery:

Comparison with Existing Approaches

Training Paradigm Comparison

Architectural Comparison

Implications for Production Systems

Deployment Considerations

Implementation Strategy

When to Use AgentFlow vs Alternatives

Technical Deep Dive: Flow-GRPO

Credit Assignment Challenge

Flow-GRPO Solution

Approach	Correct Tool %	Hallucinated Tools %	Incomplete Calls %
Monolithic (GPT-4o)	72.1%	18.3%	9.6%
ReAct Framework	68.4%	21.7%	9.9%
AgentFlow	87.3%	7.2%	5.5%

Approach	Training Type	Multi-turn Optimization	Live Environment
Supervised Fine-tuning	Offline, Fixed Dataset	❌ No	❌ No
Post-hoc RL (DPO/PPO)	Offline on Trajectories	Limited	❌ No
Test-Time Scaffolding (ReAct, AutoGPT)	Training-Free	N/A	✅ Yes (no learning)
AgentFlow (Flow-GRPO)	Online, In-the-Flow	✅ Yes	✅ Yes

Aspect	Monolithic Models	Static Agents	AgentFlow
Module Separation	Single model	Prompt-based modules	Trained specialized modules
Memory Management	Full context window	Implicit in prompts	Explicit evolving memory
Planning	Implicit in generation	Template-based	Learned planner module
Tool Integration	Interleaved with text	Executor abstraction	Dedicated executor module
Verification	Self-reflection	Rule-based or LLM judge	Trained verifier

Step 1: Outcome Broadcasting

Trajectory-level outcome R (0 for failure, 1 for success) broadcast to all turns:

advantage_t = R - baseline_t

This assumes all turns contributed equally to the outcome, providing global learning signal.

Step 2: Group Normalization

Instead of computing baseline across entire trajectory, compute within turn groups:

baseline_t = mean(R for all trajectories at turn t)

This provides stable, consistent comparisons: each turn's plan is evaluated against other plans made at the same decision point.

Step 3: Policy Update

Standard policy gradient update with clipped objective:

L = min(ratio * advantage, clip(ratio, 1-ε, 1+ε) * advantage)

Where ratio = π_new(action|state) / π_old(action|state)

Why This Works

Ablation Studies

Impact of Training Method

Impact of Architecture Components

Impact of Group Normalization

Limitations & Future Directions

Current Limitations

Future Research Opportunities

Real-World Case Studies

Case Study 1: E-Commerce Product Search (WebShop)

Training Method	WebShop	ToolBench	GSM8K
No Training (Base Model)	38.2%	41.7%	65.3%
Supervised Fine-tuning	45.8%	48.2%	72.1%
Post-hoc DPO	47.1%	51.4%	71.2%
Post-hoc PPO	49.3%	53.7%	74.8%
Flow-GRPO (AgentFlow)	61.2%	62.8%	84.0%

Configuration	Avg Accuracy	Tool Error Rate
Monolithic (no modules)	58.3%	27.9%
w/o Verifier	63.7%	18.4%
w/o Memory	65.2%	15.1%
w/o Planner Training	66.8%	13.2%
Full AgentFlow	69.4%	7.2%

Task:

"Find a blue cotton t-shirt under $25 with at least 4-star rating and add to cart"

Monolithic Model Behavior:

Searches "blue t-shirt" → 427 results
Attempts to filter by price but uses incorrect API parameters
Manually checks 15 items before running out of context window
Adds item that doesn't meet rating requirement (3.8 stars)

Result: Task failure, 15+ turns, 8 API errors

AgentFlow Behavior:

Plan: "Apply filters first to reduce search space, then verify specifications"
Execute: Search with filters → 23 results
Verify: Check filter application successful
Plan: "Sort by rating, examine top candidates"
Execute: Sort and select item
Verify: Confirm all requirements met (verifier catches color mismatch)
Replan: "Check next candidate"
Generate: Add correct item to cart

Result: Task success, 7 turns, 0 API errors

Case Study 2: Multi-API Orchestration (ToolBench)

Task:

"Get weather forecast for tomorrow in user's location, then suggest appropriate outdoor activities"

ReAct Framework Behavior:

Calls location API successfully
Hallucinates weather API parameters (uses incorrect date format)
Receives error, tries 2 more times with same parameters
Gives up on weather, provides generic activity suggestions

Result: Partial completion, 12 turns

AgentFlow Behavior:

Plan: "Get location, retrieve weather, match activities to conditions"
Execute: Location API call
Execute: Weather API call with incorrect format
Verify: Detects API error
Replan: "Check API documentation format, retry with correct parameters"
Execute: Successful weather retrieval
Generate: Activity suggestions based on actual forecast

Result: Full completion, 7 turns, successful error recovery

Implementation Insights

Training Infrastructure Requirements

Hyperparameter Settings

Parameter	Value	Notes
Learning Rate	1e-5	Lower than standard SFT due to policy gradient instability
PPO Clip Epsilon	0.2	Standard value works well
Batch Size	64 trajectories	Smaller than typical RL due to trajectory length
Epochs per Batch	4	Multiple updates per collected data
Max Turns	10	Task-dependent; allows sufficient planning cycles
Temperature (Rollout)	0.8	Higher than inference for exploration

Conclusions

AgentFlow represents a paradigm shift in building trainable agentic systems for complex, tool-using tasks. Key takeaways:

Core Innovations

Modular Architecture with Explicit Memory: Decomposing agent work into specialized, coordinated modules with evolving memory enables more reliable planning and execution than monolithic approaches
In-the-Flow Optimization: Training directly within live multi-turn interactions dramatically outperforms post-hoc training on fixed datasets, achieving 20-30% relative improvements
Flow-GRPO Algorithm: Novel training method solving long-horizon sparse-reward credit assignment through trajectory-level outcome broadcasting and turn-level group normalization
Efficient Scaling: 7B models trained with AgentFlow match or exceed GPT-4o performance across diverse benchmarks, demonstrating 20-25× compute efficiency gains

Performance Highlights

Search Tasks: +14.9% average improvement over baselines
Agentic Tasks: +14.0% average improvement, with 87% tool selection accuracy
Mathematical Reasoning: +14.5% over similar-scale models, approaching GPT-4o performance
Error Recovery: 68% successful replanning rate vs 34% for monolithic models

For Practitioners Building Agent Systems

Invest in Modular Design: Explicit separation of planning, execution, verification, and generation pays significant dividends in reliability and debuggability
Enable In-the-Flow Learning: If you have task environments, in-the-flow training dramatically outperforms offline alternatives despite increased complexity
Build for Error Recovery: Explicit verifier modules with replanning capabilities are critical for production reliability
Scale Smartly: Smaller models with AgentFlow can replace much larger monolithic models for many tasks, reducing costs while maintaining or improving performance
Plan for Continuous Improvement: Architecture supports ongoing learning from production interactions, enabling systems that improve over time

Broader Impact

AgentFlow demonstrates that trainable agentic systems can be practical at accessible model scales (7B parameters), democratizing advanced agent capabilities beyond organizations with access to the largest proprietary models. The open-source implementation enables reproducible research and practical deployment of sophisticated tool-using agents across diverse domains.

Looking Forward: The success of modular, trainable agent architectures opens exciting research directions including hierarchical planning, learned memory management, multi-agent collaboration, and end-to-end optimization of all agent components. As environments become more complex and tools more numerous, the ability to train specialized, coordinated agent modules will become increasingly critical for building reliable AI systems.

AgentFlow: In-the-Flow Agentic System Optimization for Effective Planning and Tool Use

Executive Summary

Research Context & Motivation

Key Contributions

Methodology

AgentFlow Architecture

Flow-GRPO: Training in Live Environments

Core Mechanisms:

In-the-Flow vs Post-hoc Optimization

Key Findings

AgentFlow Outperforms Monolithic and Static Baselines

In-the-Flow Training is Critical

Modular Design Enables Better Planning

Performance Results

Search Task Benchmarks

Agentic Task Benchmarks

Mathematical Reasoning Benchmarks

Scientific Reasoning Benchmark

Scaling Analysis

Model Size Scaling

Reasoning Turn Scaling

Qualitative Analysis: Planning Quality

Tool Selection Reliability

Error Recovery Patterns

Comparison with Existing Approaches

Training Paradigm Comparison

Architectural Comparison

Implications for Production Systems

Deployment Considerations

Implementation Strategy

When to Use AgentFlow vs Alternatives

Use AgentFlow When:

Consider Alternatives When:

Technical Deep Dive: Flow-GRPO

Credit Assignment Challenge

Flow-GRPO Solution

Step 1: Outcome Broadcasting

Step 2: Group Normalization

Step 3: Policy Update

Why This Works

Ablation Studies

Impact of Training Method

Impact of Architecture Components

Impact of Group Normalization

Limitations & Future Directions

Current Limitations

Future Research Opportunities

Real-World Case Studies

Case Study 1: E-Commerce Product Search (WebShop)

Task:

Monolithic Model Behavior:

AgentFlow Behavior:

Case Study 2: Multi-API Orchestration (ToolBench)

Task:

ReAct Framework Behavior:

AgentFlow Behavior:

Implementation Insights

Training Infrastructure Requirements

Hyperparameter Settings

Conclusions

Core Innovations

Performance Highlights

For Practitioners Building Agent Systems

Broader Impact

References