Measuring Agents in Production
An Empirical Study of 306 Practitioners

Melissa Z. Pan, Negar Arabzadeh, Riccardo Cogo, Yuxuan Zhu, Alexander Xiong et al.
UC Berkeley, Stanford, IBM Research, UIUC, Intesa Sanpaolo
December 2025

Executive Summary

This landmark empirical study surveys 306 practitioners and conducts 20 in-depth interviews across 26 application domains to understand how AI agents actually work in production. The findings challenge common assumptions: 68% of production agents execute at most 10 steps before requiring human intervention, 70% rely on prompting off-the-shelf models instead of weight tuning, and 74% depend primarily on human evaluation rather than automated benchmarks.

The core insight is that successful production teams deliberately trade capability for controllability. They accept limited autonomy in exchange for reliability, using constrained architectures with predefined workflows (80% of deployments) rather than open-ended autonomous agents. Despite the common claim that "95% of agent deployments fail," this research shows practitioners successfully deploy reliable systems serving real users through environmental and operational constraints.

🎯 ELI5: Production Agents vs Research Agents

Imagine the difference between a self-driving car in a research demo versus one actually transporting passengers. Research demos show cars navigating complex cities autonomously. But real taxi services? They run on fixed routes, have human backup drivers, and pull over if anything seems off. Production AI agents work the same way—they're deliberately "dumbed down" to be predictable and safe, not impressive and autonomous. Teams discovered that a reliable agent doing 5 simple steps is worth more than an impressive one that fails unpredictably on step 47.

Figure 1: Why organizations build agents—73% cite productivity and efficiency gains as the primary motivation, far outpacing automation (54.1%) or cost reduction (43.2%).

Part 1: The Reality Gap in Agent Research

Agent research has produced remarkable demonstrations—systems that browse the web, write code, and complete multi-step tasks autonomously. Yet the gap between research benchmarks and production deployment remains vast. This study provides the first large-scale empirical evidence of what actually works when AI agents serve real users.

The Autonomy-Reliability Tradeoff

The most striking finding is that production agents are deliberately constrained:

68% execute ≤10 steps before human intervention
46.7% execute ≤5 steps—essentially glorified assistants
80% use predefined static workflows, not open-ended planning
85% build custom implementations rather than using agent frameworks

Teams have learned that controllability beats capability. A reliable 5-step agent delivers more value than an unreliable 50-step one.

Figure 2: Distribution across 26 application domains—finance (39.1%), technology (24.6%), and corporate services (23.2%) lead adoption, but agents span healthcare, legal, retail, and beyond.

Who's Actually Using Agents?

Contrary to the focus on coding assistants in research, production agents serve diverse industries. 92.5% serve human end-users rather than other systems, and 66% allow response times of minutes or longer—suggesting complex, high-value tasks rather than real-time interactions.

Domain	Percentage	Key Use Cases
Finance	39.1%	Document processing, compliance, risk analysis
Technology	24.6%	DevOps, code review, infrastructure management
Corporate Services	23.2%	HR workflows, procurement, internal tools
Healthcare	7.2%	Clinical documentation, research synthesis
Other	5.9%	Legal, retail, manufacturing, education

Part 2: Technical Patterns That Work

The study reveals surprisingly simple technical approaches dominate production. The sophisticated techniques emphasized in research—fine-tuning, complex planning, multi-agent systems—are rarely used in successful deployments.

Figure 3: Model selection in production—70% use off-the-shelf models without weight tuning. Prompting dominates over fine-tuning.

What Production Teams Actually Use

Prompting over tuning: 70% use off-the-shelf models; 79% rely on manual prompt construction
Custom over frameworks: 85% build custom implementations; only 15% use frameworks like LangChain
Static over dynamic: 80% use predefined workflows; open-ended autonomy is rare
Human-in-loop: 74% rely primarily on human evaluation for quality assurance

The Framework Paradox

Despite the proliferation of agent frameworks (LangChain, AutoGPT, CrewAI), 85% of successful production systems are custom-built. Interview participants cited three reasons:

Control: Frameworks abstract away details teams need to manage
Debugging: Custom code is easier to trace and fix
Performance: Frameworks add overhead for features not needed

As one practitioner noted: "We tried [framework X] but couldn't debug failures. We rebuilt in plain Python in a week and never looked back."

Figure 4: Framework adoption patterns—custom implementations dominate (85%), with LangChain leading among framework users. Most teams avoid framework abstractions for production reliability.

Part 3: The Evaluation Crisis

Perhaps the most concerning finding: production agent evaluation remains immature. Teams struggle to measure whether their agents actually work, relying heavily on human judgment rather than automated metrics.

Figure 5: Evaluation practices in production—74% rely primarily on human-in-the-loop evaluation. Only 25% use any formal benchmarks.

The Benchmark Gap

Research emphasizes benchmark performance (SWE-bench, WebArena, etc.), but production teams find these largely irrelevant:

75% evaluate without formal benchmarks
25% build custom benchmarks from scratch using internal data
52% use LLM-as-judge, but always combined with human verification
A/B testing and direct user feedback dominate over offline evaluation

"Production tasks are highly domain-specific; public benchmarks are rarely applicable."

Evaluation Method	Usage Rate	Notes
Human-in-the-loop	74%	Primary method for most teams
LLM-as-judge	52%	Always with human verification
A/B testing	~50%	Online evaluation preferred
Custom benchmarks	25%	Built from internal data
Public benchmarks	<10%	Used only in early development

Part 4: Top Challenges and Solutions

The study identifies reliability as the dominant challenge (37.9% cite it as their top technical focus), followed by evaluation difficulties. Interestingly, latency—often emphasized in research—impacts only 14.8% as a critical blocker.

Figure 6: Impact of latency on deployment decisions—only 14.8% cite it as a critical blocker. Most production use cases tolerate minutes-long response times.

How Teams Achieve Reliability

Rather than solving reliability through better models, successful teams constrain the problem:

Limited autonomy: Fewer steps = fewer failure points
Predefined workflows: Static paths are predictable and debuggable
Human checkpoints: Regular intervention catches errors before cascading
Constrained tools: Limited tool access reduces attack surface
Domain specialization: Narrow scope enables thorough testing

The Engineering Pattern: Trade Capability for Controllability

The study reveals a consistent engineering philosophy: successful production agents deliberately sacrifice capability for reliability. Teams that tried to build highly autonomous agents failed; teams that built constrained, predictable systems succeeded. This mirrors patterns in other high-stakes engineering domains—aviation, medical devices, nuclear systems—where reliability trumps capability.

Part 5: Implications for Agent Development

This research suggests the agent field may be over-indexing on autonomy benchmarks while under-investing in reliability engineering. The path to production isn't more capable agents—it's more controllable ones.

Recommendations for Practitioners

Start constrained: Begin with 3-5 step workflows, expand only when proven reliable
Build custom: Frameworks add complexity; plain code is easier to debug
Embrace human-in-loop: Don't fight it—design for it from the start
Create domain benchmarks: Public benchmarks won't measure your success
Measure business impact: User satisfaction matters more than benchmark scores

Research Focus	Production Reality	Gap
Long autonomous trajectories	≤10 steps with human checkpoints	Major
Fine-tuned specialist models	Prompted off-the-shelf models	Major
Agent frameworks	Custom implementations	Major
Automated evaluation	Human-in-the-loop verification	Major
Public benchmarks	Domain-specific custom benchmarks	Major

Conclusion

This empirical study of 306 practitioners provides the first large-scale evidence of what works in production AI agents. The findings are humbling for the research community: the sophisticated techniques emphasized in papers—complex planning, fine-tuning, multi-agent systems—are largely absent from successful deployments.

Instead, production success comes from engineering discipline:

68% of agents run ≤10 steps before human intervention
70% use off-the-shelf models with careful prompting
85% are custom-built, avoiding framework abstractions
74% rely on human evaluation, not automated benchmarks
80% use static workflows, not open-ended autonomy

The path to production isn't building more capable agents—it's building more reliable ones. Teams that accept constraints and design for human oversight succeed where those chasing autonomy fail.

Primary Sources

Measuring Agents in Production
Pan, Arabzadeh, Cogo, Zhu, Xiong et al., December 2025

Measuring Agents in ProductionAn Empirical Study of 306 Practitioners