Measuring Agents in Production
An Empirical Study of 306 Practitioners

Melissa Z. Pan, Negar Arabzadeh, Riccardo Cogo, Yuxuan Zhu, Alexander Xiong et al.
UC Berkeley, Stanford, IBM Research, UIUC, Intesa Sanpaolo
December 2025

Executive Summary

This landmark empirical study surveys 306 practitioners and conducts 20 in-depth interviews across 26 application domains to understand how AI agents actually work in production. The findings challenge common assumptions: 68% of production agents execute at most 10 steps before requiring human intervention, 70% rely on prompting off-the-shelf models instead of weight tuning, and 74% depend primarily on human evaluation rather than automated benchmarks.

The core insight is that successful production teams deliberately trade capability for controllability. They accept limited autonomy in exchange for reliability, using constrained architectures with predefined workflows (80% of deployments) rather than open-ended autonomous agents. Despite the common claim that "95% of agent deployments fail," this research shows practitioners successfully deploy reliable systems serving real users through environmental and operational constraints.

🎯 ELI5: Production Agents vs Research Agents

Imagine the difference between a self-driving car in a research demo versus one actually transporting passengers. Research demos show cars navigating complex cities autonomously. But real taxi services? They run on fixed routes, have human backup drivers, and pull over if anything seems off. Production AI agents work the same way—they're deliberately "dumbed down" to be predictable and safe, not impressive and autonomous. Teams discovered that a reliable agent doing 5 simple steps is worth more than an impressive one that fails unpredictably on step 47.

Reasons for building agents
Figure 1: Why organizations build agents—73% cite productivity and efficiency gains as the primary motivation, far outpacing automation (54.1%) or cost reduction (43.2%).

Part 1: The Reality Gap in Agent Research

Agent research has produced remarkable demonstrations—systems that browse the web, write code, and complete multi-step tasks autonomously. Yet the gap between research benchmarks and production deployment remains vast. This study provides the first large-scale empirical evidence of what actually works when AI agents serve real users.

The Autonomy-Reliability Tradeoff

The most striking finding is that production agents are deliberately constrained:

Teams have learned that controllability beats capability. A reliable 5-step agent delivers more value than an unreliable 50-step one.

Application domains
Figure 2: Distribution across 26 application domains—finance (39.1%), technology (24.6%), and corporate services (23.2%) lead adoption, but agents span healthcare, legal, retail, and beyond.

Who's Actually Using Agents?

Contrary to the focus on coding assistants in research, production agents serve diverse industries. 92.5% serve human end-users rather than other systems, and 66% allow response times of minutes or longer—suggesting complex, high-value tasks rather than real-time interactions.

Domain Percentage Key Use Cases
Finance 39.1% Document processing, compliance, risk analysis
Technology 24.6% DevOps, code review, infrastructure management
Corporate Services 23.2% HR workflows, procurement, internal tools
Healthcare 7.2% Clinical documentation, research synthesis
Other 5.9% Legal, retail, manufacturing, education

Part 2: Technical Patterns That Work

The study reveals surprisingly simple technical approaches dominate production. The sophisticated techniques emphasized in research—fine-tuning, complex planning, multi-agent systems—are rarely used in successful deployments.

Model selection patterns
Figure 3: Model selection in production—70% use off-the-shelf models without weight tuning. Prompting dominates over fine-tuning.

What Production Teams Actually Use

The Framework Paradox

Despite the proliferation of agent frameworks (LangChain, AutoGPT, CrewAI), 85% of successful production systems are custom-built. Interview participants cited three reasons:

As one practitioner noted: "We tried [framework X] but couldn't debug failures. We rebuilt in plain Python in a week and never looked back."

Framework adoption
Figure 4: Framework adoption patterns—custom implementations dominate (85%), with LangChain leading among framework users. Most teams avoid framework abstractions for production reliability.

Part 3: The Evaluation Crisis

Perhaps the most concerning finding: production agent evaluation remains immature. Teams struggle to measure whether their agents actually work, relying heavily on human judgment rather than automated metrics.

Evaluation practices
Figure 5: Evaluation practices in production—74% rely primarily on human-in-the-loop evaluation. Only 25% use any formal benchmarks.

The Benchmark Gap

Research emphasizes benchmark performance (SWE-bench, WebArena, etc.), but production teams find these largely irrelevant:

"Production tasks are highly domain-specific; public benchmarks are rarely applicable."

Evaluation Method Usage Rate Notes
Human-in-the-loop 74% Primary method for most teams
LLM-as-judge 52% Always with human verification
A/B testing ~50% Online evaluation preferred
Custom benchmarks 25% Built from internal data
Public benchmarks <10% Used only in early development

Part 4: Top Challenges and Solutions

The study identifies reliability as the dominant challenge (37.9% cite it as their top technical focus), followed by evaluation difficulties. Interestingly, latency—often emphasized in research—impacts only 14.8% as a critical blocker.

Deployment challenges
Figure 6: Impact of latency on deployment decisions—only 14.8% cite it as a critical blocker. Most production use cases tolerate minutes-long response times.

How Teams Achieve Reliability

Rather than solving reliability through better models, successful teams constrain the problem:

The Engineering Pattern: Trade Capability for Controllability

The study reveals a consistent engineering philosophy: successful production agents deliberately sacrifice capability for reliability. Teams that tried to build highly autonomous agents failed; teams that built constrained, predictable systems succeeded. This mirrors patterns in other high-stakes engineering domains—aviation, medical devices, nuclear systems—where reliability trumps capability.

Part 5: Implications for Agent Development

This research suggests the agent field may be over-indexing on autonomy benchmarks while under-investing in reliability engineering. The path to production isn't more capable agents—it's more controllable ones.

Recommendations for Practitioners

Research Focus Production Reality Gap
Long autonomous trajectories ≤10 steps with human checkpoints Major
Fine-tuned specialist models Prompted off-the-shelf models Major
Agent frameworks Custom implementations Major
Automated evaluation Human-in-the-loop verification Major
Public benchmarks Domain-specific custom benchmarks Major

Conclusion

This empirical study of 306 practitioners provides the first large-scale evidence of what works in production AI agents. The findings are humbling for the research community: the sophisticated techniques emphasized in papers—complex planning, fine-tuning, multi-agent systems—are largely absent from successful deployments.

Instead, production success comes from engineering discipline:

The path to production isn't building more capable agents—it's building more reliable ones. Teams that accept constraints and design for human oversight succeed where those chasing autonomy fail.

Primary Sources

Measuring Agents in Production
Pan, Arabzadeh, Cogo, Zhu, Xiong et al., December 2025