This landmark empirical study surveys 306 practitioners and conducts 20 in-depth interviews across 26 application domains to understand how AI agents actually work in production. The findings challenge common assumptions: 68% of production agents execute at most 10 steps before requiring human intervention, 70% rely on prompting off-the-shelf models instead of weight tuning, and 74% depend primarily on human evaluation rather than automated benchmarks.
The core insight is that successful production teams deliberately trade capability for controllability. They accept limited autonomy in exchange for reliability, using constrained architectures with predefined workflows (80% of deployments) rather than open-ended autonomous agents. Despite the common claim that "95% of agent deployments fail," this research shows practitioners successfully deploy reliable systems serving real users through environmental and operational constraints.
Imagine the difference between a self-driving car in a research demo versus one actually transporting passengers. Research demos show cars navigating complex cities autonomously. But real taxi services? They run on fixed routes, have human backup drivers, and pull over if anything seems off. Production AI agents work the same way—they're deliberately "dumbed down" to be predictable and safe, not impressive and autonomous. Teams discovered that a reliable agent doing 5 simple steps is worth more than an impressive one that fails unpredictably on step 47.
Agent research has produced remarkable demonstrations—systems that browse the web, write code, and complete multi-step tasks autonomously. Yet the gap between research benchmarks and production deployment remains vast. This study provides the first large-scale empirical evidence of what actually works when AI agents serve real users.
The most striking finding is that production agents are deliberately constrained:
Teams have learned that controllability beats capability. A reliable 5-step agent delivers more value than an unreliable 50-step one.
Contrary to the focus on coding assistants in research, production agents serve diverse industries. 92.5% serve human end-users rather than other systems, and 66% allow response times of minutes or longer—suggesting complex, high-value tasks rather than real-time interactions.
| Domain | Percentage | Key Use Cases |
|---|---|---|
| Finance | 39.1% | Document processing, compliance, risk analysis |
| Technology | 24.6% | DevOps, code review, infrastructure management |
| Corporate Services | 23.2% | HR workflows, procurement, internal tools |
| Healthcare | 7.2% | Clinical documentation, research synthesis |
| Other | 5.9% | Legal, retail, manufacturing, education |
The study reveals surprisingly simple technical approaches dominate production. The sophisticated techniques emphasized in research—fine-tuning, complex planning, multi-agent systems—are rarely used in successful deployments.
Despite the proliferation of agent frameworks (LangChain, AutoGPT, CrewAI), 85% of successful production systems are custom-built. Interview participants cited three reasons:
As one practitioner noted: "We tried [framework X] but couldn't debug failures. We rebuilt in plain Python in a week and never looked back."
Perhaps the most concerning finding: production agent evaluation remains immature. Teams struggle to measure whether their agents actually work, relying heavily on human judgment rather than automated metrics.
Research emphasizes benchmark performance (SWE-bench, WebArena, etc.), but production teams find these largely irrelevant:
"Production tasks are highly domain-specific; public benchmarks are rarely applicable."
| Evaluation Method | Usage Rate | Notes |
|---|---|---|
| Human-in-the-loop | 74% | Primary method for most teams |
| LLM-as-judge | 52% | Always with human verification |
| A/B testing | ~50% | Online evaluation preferred |
| Custom benchmarks | 25% | Built from internal data |
| Public benchmarks | <10% | Used only in early development |
The study identifies reliability as the dominant challenge (37.9% cite it as their top technical focus), followed by evaluation difficulties. Interestingly, latency—often emphasized in research—impacts only 14.8% as a critical blocker.
Rather than solving reliability through better models, successful teams constrain the problem:
The study reveals a consistent engineering philosophy: successful production agents deliberately sacrifice capability for reliability. Teams that tried to build highly autonomous agents failed; teams that built constrained, predictable systems succeeded. This mirrors patterns in other high-stakes engineering domains—aviation, medical devices, nuclear systems—where reliability trumps capability.
This research suggests the agent field may be over-indexing on autonomy benchmarks while under-investing in reliability engineering. The path to production isn't more capable agents—it's more controllable ones.
| Research Focus | Production Reality | Gap |
|---|---|---|
| Long autonomous trajectories | ≤10 steps with human checkpoints | Major |
| Fine-tuned specialist models | Prompted off-the-shelf models | Major |
| Agent frameworks | Custom implementations | Major |
| Automated evaluation | Human-in-the-loop verification | Major |
| Public benchmarks | Domain-specific custom benchmarks | Major |
This empirical study of 306 practitioners provides the first large-scale evidence of what works in production AI agents. The findings are humbling for the research community: the sophisticated techniques emphasized in papers—complex planning, fine-tuning, multi-agent systems—are largely absent from successful deployments.
Instead, production success comes from engineering discipline:
The path to production isn't building more capable agents—it's building more reliable ones. Teams that accept constraints and design for human oversight succeed where those chasing autonomy fail.
Measuring Agents in Production
Pan, Arabzadeh, Cogo, Zhu, Xiong et al., December 2025