Towards a Science of Scaling Agent Systems
Quantitative Principles for Multi-Agent Coordination

Yubin Kim, Ken Gu, Chanwoo Park, Samuel Schmidgall, Yuzhe Yang, Xuhai Xu, Yilun Du et al.
Google Research, Google DeepMind, MIT
December 2025

Executive Summary

This landmark study establishes quantitative scaling principles for agent systems through controlled evaluation across 180 configurations spanning three LLM families (OpenAI, Google, Anthropic) and four benchmarks. The central finding challenges the "more agents is better" assumption: multi-agent systems demonstrate highly heterogeneous performance ranging from +81% improvement to -70% degradation depending entirely on task structure.

The research introduces a predictive mixed-effects model achieving R²=0.513 cross-validation accuracy without dataset-specific parameters, identifying three dominant effects: the tool-coordination trade-off, capability saturation ceiling at ~45% baseline accuracy, and architecture-dependent error amplification ranging from 4.4× (centralized) to 17.2× (independent). The model predicts optimal architectures for 87% of held-out configurations.

This work moves agent deployment from heuristic "add more agents" guidance to principled, measurement-driven architecture selection based on task properties and coordination metrics.

🎯 ELI5: When Does Teamwork Help?

Imagine you're moving furniture. For moving a couch, two people are better than one—you can coordinate and lift together. But for packing books into boxes? Adding helpers creates chaos: people bump into each other, grab the same books, and waste time coordinating. The same happens with AI agents. Financial analysis (like moving a couch) benefits from multiple agents dividing spreadsheets and verifying each other's math. But web browsing (like packing books) gets worse with more agents—they visit the same pages, contradict each other, and waste compute coordinating. This paper provides the math to predict which tasks are "couch moves" vs "book packing" before you build anything.

Agent scaling across model intelligence and system architectures
Figure 1: Agent scaling across model intelligence (x-axis) and system architectures. Performance changes (e.g., +8.1%, -4.6%) show MAS improvement/degradation versus single-agent baseline. Multi-agent benefits are highly task- and architecture-dependent.

Part 1: The Multi-Agent Scaling Question

Prior claims like "More agents is all you need" lack empirical support. Practitioners face a critical gap: when does multi-agent coordination provide value versus simple single-agent approaches? This study addresses that gap through the largest controlled evaluation of agent architectures to date.

The Core Problem

The field has conflated two distinct phenomena:

Previous multi-agent evaluations on non-agentic benchmarks showed diminishing returns as base models improved. But truly agentic tasks—with tool use, environment interaction, and iterative refinement—remain understudied.

Study Design

The evaluation spans 180 controlled configurations with standardized tools, prompts, and metrics to isolate architectural effects:

Experimental Setup

Part 2: Five Canonical Architectures

The study formally defines an agent system as 𝒮=(A,E,C,Ω) comprising agents (A), shared environment (E), communication topology (C), and orchestration policy (Ω). Five architectures represent the spectrum from isolated to highly coordinated:

Architecture Agents Communication Complexity Use Case
Single-Agent (SAS) 1 None O(k) Sequential reasoning, simple tasks
Independent MAS n None O(nk) Embarrassingly parallel tasks
Decentralized MAS n Peer-to-peer mesh O(dnk) Collaborative exploration
Centralized MAS n+1 Hub-and-spoke O(rnk) Structured verification tasks
Hybrid MAS n+m Hierarchical + peer O((r+d)nk) Complex multi-phase tasks

Communication vs. Coordination

A critical distinction emerges: Communication is message passing between agents; Coordination is strategic direction through task decomposition and progress monitoring. Independent MAS has neither. Decentralized MAS has communication but limited coordination. Centralized MAS has both, with an orchestrator managing the coordination layer.

Part 3: Domain-Dependent Results

The headline finding: MAS performance is not uniformly positive or negative. The mean improvement across all configurations is -3.5% (95% CI: [-18.6%, +25.7%]) with massive variance (σ=45.2%). Task structure determines everything.

Comparative performance across benchmarks
Figure 2: Comparative performance of single-agent vs multi-agent systems across four benchmarks. Finance Agent shows strong MAS benefits while PlanCraft shows universal degradation.
Benchmark Best MAS Architecture Improvement vs SAS Worst MAS Architecture Degradation vs SAS
Finance Agent Centralized +80.9% Independent -12.3%
WorkBench Decentralized +5.7% Centralized -1.2%
BrowseComp-Plus Decentralized +9.2% Centralized +0.2%
PlanCraft Hybrid -39% Independent -70%

Why Finance Agent Succeeds, PlanCraft Fails

Finance Agent benefits from MAS because:

PlanCraft suffers because:

Part 4: The Scaling Principle

The research introduces a quantitative model predicting when MAS helps or hurts, achieving R²=0.513 on held-out data—explaining over half the variance without any dataset-specific parameters.

Δ Performance = f(Intelligence, Tools, Agents, Baseline) + Interaction Terms
Complete model: 20 parameters including 9 key interactions

Key Interaction Effects

Interaction Coefficient (β) p-value Interpretation
Efficiency-Tools Trade-off -0.330 <0.001 Tool-heavy tasks suffer from MAS overhead
Baseline Paradox -0.408 <0.001 High SAS performance leaves little room for MAS gains
Overhead-Complexity -0.141 <0.001 Coordination overhead scales non-linearly with task complexity
Error Propagation -0.097 0.007 Errors propagate more severely in tool-rich environments
Intelligence Quadratic +0.256 0.010 Accelerating returns at higher capability levels
Redundancy Benefit +0.041 0.040 Marginal positive effect with agent scaling

The 45% Threshold

A critical decision boundary emerges at ~45% single-agent baseline accuracy. Below this threshold, coordination can provide substantial gains. Above it, the "capability saturation ceiling" kicks in—the single agent already solves most cases, leaving limited room for MAS improvement while still incurring coordination overhead.

This threshold is derived from the model's β ratios without dataset-specific parameters, making it generalizable across domains.

Cost-performance trade-offs
Figure 3: Cost-performance trade-offs across model families. Family-dependent coordination efficacy patterns emerge—OpenAI shows consistent MAS gains; Anthropic shows higher variance with coordination overhead sensitivity.

Part 5: Error Dynamics & Coordination Overhead

Multi-agent systems don't just add compute—they fundamentally change error dynamics. The study reveals dramatic differences in how architectures amplify or absorb errors.

Error Amplification by Architecture

Architecture Error Amplification Coordination Overhead Turns (vs SAS baseline) Efficiency (Ec)
Single-Agent (SAS) 1.0× (baseline) 0% 7.2 (1.0×) 0.466
Independent MAS 17.2× 58% 11.4 (1.6×) 0.234
Decentralized MAS 7.8× 263% 26.1 (3.6×) 0.132
Centralized MAS 4.4× 285% 27.7 (3.8×) 0.120
Hybrid MAS 5.1× 515% 44.3 (6.2×) 0.074

Why Independent MAS Fails Universally

Independent MAS shows 17.2× error amplification—the worst of any architecture—because it has no inter-agent verification mechanism. Agents make independent mistakes that compound without any correction layer. The study recommends avoiding Independent MAS entirely in production deployments.

In contrast, Centralized MAS achieves only 4.4× amplification because the orchestrator provides a validation bottleneck that catches errors before propagation.

Error Categories by Architecture

Error Type SAS Baseline Best MAS Reduction Which Architecture
Logical Contradiction 12.3-18.7% -36.4% (to 9.1%) Centralized
Numerical Drift 20.9-24.1% -12.4% (to 18.3%) Centralized/Decentralized
Context Omission 15.8-25.2% -66.8% (to 8.3%) Centralized
Coordination Failure 0% (N/A) +12.4% Hybrid (worst)

The Optimal Coordination Band

Success correlates logarithmically with message density: S = 0.73 + 0.28·ln(c), R²=0.68. Performance plateaus near c*=0.39 messages/turn. The optimal coordination band is 200%-300% overhead—enough communication to catch errors, not so much that protocol complexity introduces new failure modes.

Above 400% overhead, coordination failures emerge as a new error category unique to MAS, offsetting error reduction benefits.

Number of agents scaling
Figure 4: Number of agents scaling reveals model-dependent coordination limits. Different LLMs show distinct scaling patterns—Gemini-2.0 Flash peaks at 3 agents while Gemini-2.5 Pro continues scaling to 5.

Part 6: Practical Guidelines

The model achieves 87% accuracy predicting optimal architectures on held-out configurations. Based on these results, the following decision framework emerges:

When to Use Single-Agent Systems

When to Use Centralized MAS

When to Use Decentralized MAS

Universal Recommendation: Avoid Independent MAS

Across all benchmarks, model families, and task types, Independent MAS universally underperformed. The lack of any verification mechanism leads to 17.2× error amplification with no compensating benefits. If you need multiple agents, invest in coordination infrastructure—the overhead pays for itself in error reduction.

Agent heterogeneity effects
Figure 5: Agent heterogeneity effects on multi-agent performance across BrowseComp-Plus. Capability mixing yields different results across LLM families—some benefit from diverse agent capabilities, others perform better with homogeneous teams.

Conclusion

This research transforms multi-agent system design from heuristic intuition to quantitative science. The core findings challenge prevailing assumptions:

The practical implication: before adding agents, measure your single-agent baseline. If it exceeds 45% accuracy on your task, coordination overhead likely exceeds benefits. If parallelizable subtasks exist, centralized MAS with explicit verification yields the best error-adjusted performance. The era of "more agents is better" is over—principled architecture selection based on task properties is the path forward.

Primary Sources

Towards a Science of Scaling Agent Systems
Kim, Gu, Park, Schmidgall, et al., December 2025