Towards a Science of Scaling Agent Systems - Quantitative Principles for Multi-Agent Coordination

Executive Summary

This landmark study establishes quantitative scaling principles for agent systems through controlled evaluation across 180 configurations spanning three LLM families (OpenAI, Google, Anthropic) and four benchmarks. The central finding challenges the "more agents is better" assumption: multi-agent systems demonstrate highly heterogeneous performance ranging from +81% improvement to -70% degradation depending entirely on task structure.

The research introduces a predictive mixed-effects model achieving R²=0.513 cross-validation accuracy without dataset-specific parameters, identifying three dominant effects: the tool-coordination trade-off, capability saturation ceiling at ~45% baseline accuracy, and architecture-dependent error amplification ranging from 4.4× (centralized) to 17.2× (independent). The model predicts optimal architectures for 87% of held-out configurations.

This work moves agent deployment from heuristic "add more agents" guidance to principled, measurement-driven architecture selection based on task properties and coordination metrics.

🎯 ELI5: When Does Teamwork Help?

Imagine you're moving furniture. For moving a couch, two people are better than one—you can coordinate and lift together. But for packing books into boxes? Adding helpers creates chaos: people bump into each other, grab the same books, and waste time coordinating. The same happens with AI agents. Financial analysis (like moving a couch) benefits from multiple agents dividing spreadsheets and verifying each other's math. But web browsing (like packing books) gets worse with more agents—they visit the same pages, contradict each other, and waste compute coordinating. This paper provides the math to predict which tasks are "couch moves" vs "book packing" before you build anything.

Agent scaling across model intelligence and system architectures

Figure 1: Agent scaling across model intelligence (x-axis) and system architectures. Performance changes (e.g., +8.1%, -4.6%) show MAS improvement/degradation versus single-agent baseline. Multi-agent benefits are highly task- and architecture-dependent.

Part 1: The Multi-Agent Scaling Question

Prior claims like "More agents is all you need" lack empirical support. Practitioners face a critical gap: when does multi-agent coordination provide value versus simple single-agent approaches? This study addresses that gap through the largest controlled evaluation of agent architectures to date.

The Core Problem

The field has conflated two distinct phenomena:

Agentic tasks: Requiring sustained environmental interaction, partial observability, and adaptive strategy refinement
Non-agentic benchmarks: Static reasoning without feedback loops

Previous multi-agent evaluations on non-agentic benchmarks showed diminishing returns as base models improved. But truly agentic tasks—with tool use, environment interaction, and iterative refinement—remain understudied.

Study Design

The evaluation spans 180 controlled configurations with standardized tools, prompts, and metrics to isolate architectural effects:

Experimental Setup

LLM Families: OpenAI (GPT-5-nano/mini/full), Google (Gemini 2.0/2.5 Flash/Pro), Anthropic (Claude Sonnet 3.7/4.0/4.5)
Benchmarks: BrowseComp-Plus (web), Finance-Agent (analysis), PlanCraft (Minecraft planning), WorkBench (business tasks)
Architectures: 5 canonical designs from single-agent to hybrid multi-agent
Controls: Token-matched budgets (mean 4,800 per trial), standardized tool interfaces

Part 2: Five Canonical Architectures

The study formally defines an agent system as 𝒮=(A,E,C,Ω) comprising agents (A), shared environment (E), communication topology (C), and orchestration policy (Ω). Five architectures represent the spectrum from isolated to highly coordinated:

Architecture	Agents	Communication	Complexity	Use Case
Single-Agent (SAS)	1	None	O(k)	Sequential reasoning, simple tasks
Independent MAS	n	None	O(nk)	Embarrassingly parallel tasks
Decentralized MAS	n	Peer-to-peer mesh	O(dnk)	Collaborative exploration
Centralized MAS	n+1	Hub-and-spoke	O(rnk)	Structured verification tasks
Hybrid MAS	n+m	Hierarchical + peer	O((r+d)nk)	Complex multi-phase tasks

Communication vs. Coordination

A critical distinction emerges: Communication is message passing between agents; Coordination is strategic direction through task decomposition and progress monitoring. Independent MAS has neither. Decentralized MAS has communication but limited coordination. Centralized MAS has both, with an orchestrator managing the coordination layer.

Part 3: Domain-Dependent Results

The headline finding: MAS performance is not uniformly positive or negative. The mean improvement across all configurations is -3.5% (95% CI: [-18.6%, +25.7%]) with massive variance (σ=45.2%). Task structure determines everything.

Benchmark	Best MAS Architecture	Improvement vs SAS	Worst MAS Architecture	Degradation vs SAS
Finance Agent	Centralized	+80.9%	Independent	-12.3%
WorkBench	Decentralized	+5.7%	Centralized	-1.2%
BrowseComp-Plus	Decentralized	+9.2%	Centralized	+0.2%
PlanCraft	Hybrid	-39%	Independent	-70%

Why Finance Agent Succeeds, PlanCraft Fails

Finance Agent benefits from MAS because:

Tasks are naturally parallelizable (analyze different spreadsheets)
Structured numerical outputs facilitate verification
Clear subtask boundaries enable clean task decomposition

PlanCraft suffers because:

Sequential dependencies between crafting steps
Shared state (inventory) creates coordination conflicts
Errors compound rapidly through the dependency chain

Part 4: The Scaling Principle

The research introduces a quantitative model predicting when MAS helps or hurts, achieving R²=0.513 on held-out data—explaining over half the variance without any dataset-specific parameters.

Key Interaction Effects

Interaction	Coefficient (β)	p-value	Interpretation
Efficiency-Tools Trade-off	-0.330	<0.001	Tool-heavy tasks suffer from MAS overhead
Baseline Paradox	-0.408	<0.001	High SAS performance leaves little room for MAS gains
Overhead-Complexity	-0.141	<0.001	Coordination overhead scales non-linearly with task complexity
Error Propagation	-0.097	0.007	Errors propagate more severely in tool-rich environments
Intelligence Quadratic	+0.256	0.010	Accelerating returns at higher capability levels
Redundancy Benefit	+0.041	0.040	Marginal positive effect with agent scaling

The 45% Threshold

A critical decision boundary emerges at ~45% single-agent baseline accuracy. Below this threshold, coordination can provide substantial gains. Above it, the "capability saturation ceiling" kicks in—the single agent already solves most cases, leaving limited room for MAS improvement while still incurring coordination overhead.

This threshold is derived from the model's β ratios without dataset-specific parameters, making it generalizable across domains.

Part 5: Error Dynamics & Coordination Overhead

Multi-agent systems don't just add compute—they fundamentally change error dynamics. The study reveals dramatic differences in how architectures amplify or absorb errors.

Error Amplification by Architecture

Architecture	Error Amplification	Coordination Overhead	Turns (vs SAS baseline)	Efficiency (Ec)
Single-Agent (SAS)	1.0× (baseline)	0%	7.2 (1.0×)	0.466
Independent MAS	17.2×	58%	11.4 (1.6×)	0.234
Decentralized MAS	7.8×	263%	26.1 (3.6×)	0.132
Centralized MAS	4.4×	285%	27.7 (3.8×)	0.120
Hybrid MAS	5.1×	515%	44.3 (6.2×)	0.074

Why Independent MAS Fails Universally

Independent MAS shows 17.2× error amplification—the worst of any architecture—because it has no inter-agent verification mechanism. Agents make independent mistakes that compound without any correction layer. The study recommends avoiding Independent MAS entirely in production deployments.

In contrast, Centralized MAS achieves only 4.4× amplification because the orchestrator provides a validation bottleneck that catches errors before propagation.

Error Categories by Architecture

Error Type	SAS Baseline	Best MAS Reduction	Which Architecture
Logical Contradiction	12.3-18.7%	-36.4% (to 9.1%)	Centralized
Numerical Drift	20.9-24.1%	-12.4% (to 18.3%)	Centralized/Decentralized
Context Omission	15.8-25.2%	-66.8% (to 8.3%)	Centralized
Coordination Failure	0% (N/A)	+12.4%	Hybrid (worst)

The Optimal Coordination Band

Success correlates logarithmically with message density: S = 0.73 + 0.28·ln(c), R²=0.68. Performance plateaus near c*=0.39 messages/turn. The optimal coordination band is 200%-300% overhead—enough communication to catch errors, not so much that protocol complexity introduces new failure modes.

Above 400% overhead, coordination failures emerge as a new error category unique to MAS, offsetting error reduction benefits.

Part 6: Practical Guidelines

The model achieves 87% accuracy predicting optimal architectures on held-out configurations. Based on these results, the following decision framework emerges:

When to Use Single-Agent Systems

High baseline performance: SAS accuracy >45% leaves limited room for MAS gains
Sequential dependencies: Tasks with strict ordering (crafting chains, multi-step proofs)
Shared mutable state: When agents would conflict over resources
Low tool complexity: Few tool calls reduce the need for parallelization
Cost sensitivity: SAS achieves 67.7 success/1K tokens vs 13.6 for Hybrid

When to Use Centralized MAS

Parallelizable subtasks: Independent spreadsheets, documents, or data sources
Numerical reasoning: Structured outputs facilitate verification (+80.9% on Finance)
Clear decomposition: Tasks with natural boundaries for subtask assignment
Error-critical domains: Centralized achieves best error absorption (4.4× vs 17.2×)

When to Use Decentralized MAS

Exploratory tasks: Web browsing, research synthesis (+9.2% on BrowseComp)
Moderate tool complexity: Benefits from parallel exploration
Dynamic environments: Peer communication enables real-time adaptation

Universal Recommendation: Avoid Independent MAS

Across all benchmarks, model families, and task types, Independent MAS universally underperformed. The lack of any verification mechanism leads to 17.2× error amplification with no compensating benefits. If you need multiple agents, invest in coordination infrastructure—the overhead pays for itself in error reduction.

Conclusion

This research transforms multi-agent system design from heuristic intuition to quantitative science. The core findings challenge prevailing assumptions:

MAS is not universally beneficial: Mean improvement is -3.5% with variance of ±45%
Task structure determines outcomes: Finance Agent gains +80.9%; PlanCraft loses -70%
Predictable decision boundaries exist: The ~45% baseline threshold and R²=0.513 model enable principled architecture selection
Error dynamics vary dramatically: From 4.4× (centralized) to 17.2× (independent) amplification
Independent MAS should be avoided: Universal underperformance across all conditions

The practical implication: before adding agents, measure your single-agent baseline. If it exceeds 45% accuracy on your task, coordination overhead likely exceeds benefits. If parallelizable subtasks exist, centralized MAS with explicit verification yields the best error-adjusted performance. The era of "more agents is better" is over—principled architecture selection based on task properties is the path forward.

Towards a Science of Scaling Agent Systems
Quantitative Principles for Multi-Agent Coordination

Executive Summary

🎯 ELI5: When Does Teamwork Help?

Part 1: The Multi-Agent Scaling Question

The Core Problem

Study Design

Experimental Setup

Part 2: Five Canonical Architectures

Communication vs. Coordination

Part 3: Domain-Dependent Results

Why Finance Agent Succeeds, PlanCraft Fails

Part 4: The Scaling Principle

Key Interaction Effects

The 45% Threshold

Part 5: Error Dynamics & Coordination Overhead

Error Amplification by Architecture

Why Independent MAS Fails Universally

Error Categories by Architecture

The Optimal Coordination Band

Part 6: Practical Guidelines

When to Use Single-Agent Systems

When to Use Centralized MAS

When to Use Decentralized MAS

Universal Recommendation: Avoid Independent MAS

Conclusion

Primary Sources

Towards a Science of Scaling Agent SystemsQuantitative Principles for Multi-Agent Coordination

Executive Summary

🎯 ELI5: When Does Teamwork Help?

Part 1: The Multi-Agent Scaling Question

The Core Problem

Study Design

Experimental Setup

Part 2: Five Canonical Architectures

Communication vs. Coordination

Part 3: Domain-Dependent Results

Why Finance Agent Succeeds, PlanCraft Fails

Part 4: The Scaling Principle

Key Interaction Effects

The 45% Threshold

Part 5: Error Dynamics & Coordination Overhead

Error Amplification by Architecture

Why Independent MAS Fails Universally

Error Categories by Architecture

The Optimal Coordination Band

Part 6: Practical Guidelines

When to Use Single-Agent Systems

When to Use Centralized MAS

When to Use Decentralized MAS

Universal Recommendation: Avoid Independent MAS

Conclusion

Primary Sources

Towards a Science of Scaling Agent Systems
Quantitative Principles for Multi-Agent Coordination