This landmark study establishes quantitative scaling principles for agent systems through controlled evaluation across 180 configurations spanning three LLM families (OpenAI, Google, Anthropic) and four benchmarks. The central finding challenges the "more agents is better" assumption: multi-agent systems demonstrate highly heterogeneous performance ranging from +81% improvement to -70% degradation depending entirely on task structure.
The research introduces a predictive mixed-effects model achieving R²=0.513 cross-validation accuracy without dataset-specific parameters, identifying three dominant effects: the tool-coordination trade-off, capability saturation ceiling at ~45% baseline accuracy, and architecture-dependent error amplification ranging from 4.4× (centralized) to 17.2× (independent). The model predicts optimal architectures for 87% of held-out configurations.
This work moves agent deployment from heuristic "add more agents" guidance to principled, measurement-driven architecture selection based on task properties and coordination metrics.
Imagine you're moving furniture. For moving a couch, two people are better than one—you can coordinate and lift together. But for packing books into boxes? Adding helpers creates chaos: people bump into each other, grab the same books, and waste time coordinating. The same happens with AI agents. Financial analysis (like moving a couch) benefits from multiple agents dividing spreadsheets and verifying each other's math. But web browsing (like packing books) gets worse with more agents—they visit the same pages, contradict each other, and waste compute coordinating. This paper provides the math to predict which tasks are "couch moves" vs "book packing" before you build anything.
Prior claims like "More agents is all you need" lack empirical support. Practitioners face a critical gap: when does multi-agent coordination provide value versus simple single-agent approaches? This study addresses that gap through the largest controlled evaluation of agent architectures to date.
The field has conflated two distinct phenomena:
Previous multi-agent evaluations on non-agentic benchmarks showed diminishing returns as base models improved. But truly agentic tasks—with tool use, environment interaction, and iterative refinement—remain understudied.
The evaluation spans 180 controlled configurations with standardized tools, prompts, and metrics to isolate architectural effects:
The study formally defines an agent system as 𝒮=(A,E,C,Ω) comprising agents (A), shared environment (E), communication topology (C), and orchestration policy (Ω). Five architectures represent the spectrum from isolated to highly coordinated:
| Architecture | Agents | Communication | Complexity | Use Case |
|---|---|---|---|---|
| Single-Agent (SAS) | 1 | None | O(k) | Sequential reasoning, simple tasks |
| Independent MAS | n | None | O(nk) | Embarrassingly parallel tasks |
| Decentralized MAS | n | Peer-to-peer mesh | O(dnk) | Collaborative exploration |
| Centralized MAS | n+1 | Hub-and-spoke | O(rnk) | Structured verification tasks |
| Hybrid MAS | n+m | Hierarchical + peer | O((r+d)nk) | Complex multi-phase tasks |
A critical distinction emerges: Communication is message passing between agents; Coordination is strategic direction through task decomposition and progress monitoring. Independent MAS has neither. Decentralized MAS has communication but limited coordination. Centralized MAS has both, with an orchestrator managing the coordination layer.
The headline finding: MAS performance is not uniformly positive or negative. The mean improvement across all configurations is -3.5% (95% CI: [-18.6%, +25.7%]) with massive variance (σ=45.2%). Task structure determines everything.
| Benchmark | Best MAS Architecture | Improvement vs SAS | Worst MAS Architecture | Degradation vs SAS |
|---|---|---|---|---|
| Finance Agent | Centralized | +80.9% | Independent | -12.3% |
| WorkBench | Decentralized | +5.7% | Centralized | -1.2% |
| BrowseComp-Plus | Decentralized | +9.2% | Centralized | +0.2% |
| PlanCraft | Hybrid | -39% | Independent | -70% |
Finance Agent benefits from MAS because:
PlanCraft suffers because:
The research introduces a quantitative model predicting when MAS helps or hurts, achieving R²=0.513 on held-out data—explaining over half the variance without any dataset-specific parameters.
| Interaction | Coefficient (β) | p-value | Interpretation |
|---|---|---|---|
| Efficiency-Tools Trade-off | -0.330 | <0.001 | Tool-heavy tasks suffer from MAS overhead |
| Baseline Paradox | -0.408 | <0.001 | High SAS performance leaves little room for MAS gains |
| Overhead-Complexity | -0.141 | <0.001 | Coordination overhead scales non-linearly with task complexity |
| Error Propagation | -0.097 | 0.007 | Errors propagate more severely in tool-rich environments |
| Intelligence Quadratic | +0.256 | 0.010 | Accelerating returns at higher capability levels |
| Redundancy Benefit | +0.041 | 0.040 | Marginal positive effect with agent scaling |
A critical decision boundary emerges at ~45% single-agent baseline accuracy. Below this threshold, coordination can provide substantial gains. Above it, the "capability saturation ceiling" kicks in—the single agent already solves most cases, leaving limited room for MAS improvement while still incurring coordination overhead.
This threshold is derived from the model's β ratios without dataset-specific parameters, making it generalizable across domains.
Multi-agent systems don't just add compute—they fundamentally change error dynamics. The study reveals dramatic differences in how architectures amplify or absorb errors.
| Architecture | Error Amplification | Coordination Overhead | Turns (vs SAS baseline) | Efficiency (Ec) |
|---|---|---|---|---|
| Single-Agent (SAS) | 1.0× (baseline) | 0% | 7.2 (1.0×) | 0.466 |
| Independent MAS | 17.2× | 58% | 11.4 (1.6×) | 0.234 |
| Decentralized MAS | 7.8× | 263% | 26.1 (3.6×) | 0.132 |
| Centralized MAS | 4.4× | 285% | 27.7 (3.8×) | 0.120 |
| Hybrid MAS | 5.1× | 515% | 44.3 (6.2×) | 0.074 |
Independent MAS shows 17.2× error amplification—the worst of any architecture—because it has no inter-agent verification mechanism. Agents make independent mistakes that compound without any correction layer. The study recommends avoiding Independent MAS entirely in production deployments.
In contrast, Centralized MAS achieves only 4.4× amplification because the orchestrator provides a validation bottleneck that catches errors before propagation.
| Error Type | SAS Baseline | Best MAS Reduction | Which Architecture |
|---|---|---|---|
| Logical Contradiction | 12.3-18.7% | -36.4% (to 9.1%) | Centralized |
| Numerical Drift | 20.9-24.1% | -12.4% (to 18.3%) | Centralized/Decentralized |
| Context Omission | 15.8-25.2% | -66.8% (to 8.3%) | Centralized |
| Coordination Failure | 0% (N/A) | +12.4% | Hybrid (worst) |
Success correlates logarithmically with message density: S = 0.73 + 0.28·ln(c), R²=0.68. Performance plateaus near c*=0.39 messages/turn. The optimal coordination band is 200%-300% overhead—enough communication to catch errors, not so much that protocol complexity introduces new failure modes.
Above 400% overhead, coordination failures emerge as a new error category unique to MAS, offsetting error reduction benefits.
The model achieves 87% accuracy predicting optimal architectures on held-out configurations. Based on these results, the following decision framework emerges:
Across all benchmarks, model families, and task types, Independent MAS universally underperformed. The lack of any verification mechanism leads to 17.2× error amplification with no compensating benefits. If you need multiple agents, invest in coordination infrastructure—the overhead pays for itself in error reduction.
This research transforms multi-agent system design from heuristic intuition to quantitative science. The core findings challenge prevailing assumptions:
The practical implication: before adding agents, measure your single-agent baseline. If it exceeds 45% accuracy on your task, coordination overhead likely exceeds benefits. If parallelizable subtasks exist, centralized MAS with explicit verification yields the best error-adjusted performance. The era of "more agents is better" is over—principled architecture selection based on task properties is the path forward.
Towards a Science of Scaling Agent Systems
Kim, Gu, Park, Schmidgall, et al., December 2025