This analysis synthesizes two groundbreaking research papers to answer a critical question: when will AI agents become reliable enough for real-world deployment? By combining Toby Ord's half-life framework with METR's exponential growth analysis, we can now predict specific reliability thresholds for AI automation.
The key insight is that AI agents currently fail at a constant rate per unit time (following a half-life model similar to radioactive decay), but this failure rate is improving exponentially—doubling in capability every 7 months. This creates predictable thresholds: current models (2025) can handle 50-minute tasks at 50% reliability, but achieving 90% reliability requires limiting tasks to just 7 minutes. By 2030, we project AI will handle month-long tasks, fundamentally transforming software development and business operations.
Imagine AI agents are like runners who get exponentially more tired as they run. Right now, they can run for about 50 minutes before having a 50% chance of collapsing. But here's the catch: to have a 90% chance of finishing, they can only run for 7 minutes. The good news? Every 7 months, AI agents can run twice as far before getting tired. So what takes 50 minutes today at 50% success will take 100 minutes in 7 months, 200 minutes in 14 months, and so on.
Toby Ord's analysis introduces a powerful conceptual framework: AI agent performance follows survival analysis patterns similar to radioactive decay. Just as radioactive atoms have a constant probability of decay per unit time, AI agents have a constant probability of failure per minute of operation.
The half-life model reveals that AI systems experience what Ord calls a "constant hazard rate"—each minute of operation carries the same probability of failure, regardless of how long the agent has been running. This creates an exponential decay curve for success probability:
This mathematical relationship has profound implications for reliability engineering. It means that small increases in task duration lead to dramatic decreases in reliability. More importantly, it provides a quantitative framework for understanding exactly how much we need to reduce task duration to achieve target reliability levels.
The exponential decay model reveals sharp reliability cliffs that every AI engineer must understand:
This explains why AI agents that seem "almost there" for long tasks still fail catastrophically—the difference between 50% and 90% success isn't linear, it's exponential.
The constant hazard rate suggests current AI systems fundamentally lack error recovery mechanisms. Unlike humans who can recognize mistakes and backtrack, AI agents compound errors forward. Once an agent makes a mistake, it rarely recovers—each subsequent action builds on the flawed foundation. This is why breaking tasks into smaller, independently verifiable chunks is so critical for current systems.
While Ord's half-life model explains why AI fails, METR's comprehensive study of 13 frontier models from 2019-2025 reveals how fast this is improving. Their research introduces a critical metric: the "50% task-completion time horizon"—the duration of tasks (measured in human time) that AI can complete with 50% reliability.
Design benchmark tasks with known human completion times
Test both humans and AI agents on the same tasks
Analyze success rates vs. task duration patterns
Source: METR Paper on arXiv
METR's analysis reveals remarkably consistent exponential growth: the 50% time horizon has doubled approximately every 7 months for the past 6 years. This isn't just incremental improvement—it's compound growth that fundamentally changes what's possible:
METR's qualitative analysis identifies three key drivers of capability improvement:
Interestingly, the improvements aren't coming from longer context windows or more parameters alone—they're emerging from better reasoning architectures and training methodologies. This suggests the trend may continue even as we approach scaling limits.
By combining both frameworks, we can now predict when specific use cases become viable. The key insight is that while the half-life (50% success duration) is growing exponentially, the reliability requirements for different applications create distinct adoption thresholds.
Combining both frameworks gives us a powerful predictive model:
This allows us to calculate exactly when any given task at any reliability threshold becomes feasible.
| Use Case | Required Reliability | Max Task Duration | Viable Date | Status |
|---|---|---|---|---|
| Code Review Assistant | 80% | 15 minutes | Now | Ready |
| Bug Fix Automation | 90% | 7 minutes | Now | Ready |
| Test Generation | 85% | 10 minutes | Now | Ready |
| Feature Implementation | 80% | 2 hours | 2026 | Soon |
| Full Sprint Automation | 90% | 1 day | 2028 | Future |
| Project Management | 99% | 1 week | 2030+ | Future |
The exponential nature of both curves creates what we call "reliability cliffs"—sharp transitions where tasks go from impossible to trivial. A task that's completely infeasible today might become 90% reliable just 14 months later. This creates unique challenges for organizations trying to plan their AI adoption strategies.
Consider a 4-hour task that currently has near-zero success rate. By 2027, when the 50% horizon reaches 3.5 hours, this task will suddenly achieve ~40% success. Just 7 months later, it will reach 67% success. This rapid transition from "impossible" to "reliable" will happen across thousands of business processes simultaneously.
For 90% reliability with current frontier models, decompose all tasks into subtasks that can complete in under 7 minutes. This is the fundamental constraint that should guide all production AI system design today.
This constraint has profound implications for system architecture. Rather than giving an AI agent a complex task like "implement a new feature," successful systems break this into discrete, verifiable subtasks: "analyze requirements" (5 min), "design data model" (5 min), "implement model class" (7 min), "write unit tests" (5 min), etc.
| Pattern | Description | Reliability Improvement | Use Case |
|---|---|---|---|
| Task Decomposition | Break into <5 minute subtasks | 2-3× improvement | All complex tasks |
| Parallel Ensemble | Run 3 agents, take majority vote | 90% → 97% | Critical decisions |
| Human Checkpoints | Review every 15-30 minutes | Catches cascade failures | Long-running tasks |
| State Verification | Test after each subtask | Early error detection | Stateful operations |
| Rollback Capability | Checkpoint before changes | Recovery from failures | Production systems |
Understanding the half-life model helps identify several anti-patterns that doom AI projects to failure:
Based on our unified model, organizations should plan their AI adoption strategy around these reliability thresholds:
The exponential growth creates narrow windows of competitive advantage. Organizations that adopt AI capabilities 6-12 months early in each wave gain significant advantages, but waiting too long means competitors achieve the same capabilities. The key is identifying when reliability crosses the threshold for your specific use cases.
For any critical business process:
While our unified model provides valuable predictions, several limitations must be acknowledged:
Several critical questions remain unanswered and represent active areas of research:
Our model suggests 2030 represents a critical inflection point where month-long tasks become automatable. This isn't just quantitative improvement—it's a qualitative shift in what's possible. Month-long tasks include:
Organizations that haven't adapted their workflows by this point will face existential competitive pressure. The difference between companies leveraging month-long AI automation and those limited to day-long tasks will be similar to the current gap between digitized and paper-based businesses.
Interestingly, the half-life model provides a quantitative framework for thinking about artificial general intelligence (AGI). Human professionals can maintain performance over years-long projects. If the 7-month doubling continues, AI would match this around 2035-2040. However, this assumes no fundamental breakthroughs in error recovery or memory systems.
The convergence of exponential growth with predictable reliability thresholds creates a unique moment in history. We can now predict, with reasonable confidence, when specific cognitive tasks will become automatable. This isn't science fiction—it's engineering planning with quantifiable error bounds. Organizations, educational institutions, and governments that understand these dynamics can prepare proactively rather than react defensively.
The synthesis of Ord's half-life framework with METR's growth analysis provides the first quantitative model for predicting AI reliability thresholds. Current AI agents fail at a constant rate per minute (half-life model), but this rate improves exponentially, doubling every 7 months.
For practitioners today, the message is clear: design systems around 7-minute subtasks for 90% reliability. For strategic planners, the roadmap is equally clear: prepare now for the capabilities coming in 2-3 years, not the limitations of today.
The exponential curves create both opportunity and urgency. The organizations that understand these dynamics—that recognize we're not approaching a plateau but riding an exponential curve—will define the next decade of technological progress.
The question isn't whether AI will become reliable enough for your use case. It's whether you'll be ready when it does.
Toby Ord: "Is there a Half-Life for the Success Rates of AI Agents?"
Analysis of AI agent failure patterns using survival analysis and constant hazard rate models.
METR: "Measuring AI Ability to Complete Long Tasks"
Comprehensive study of 13 frontier models showing exponential growth in task completion capabilities.