When Will AI Become Reliable?
Half-Life Analysis Meets Exponential Growth

Synthesis of research from Toby Ord (Oxford) and METR (Model Evaluation & Threat Research)
October 2025

Executive Summary

This analysis synthesizes two groundbreaking research papers to answer a critical question: when will AI agents become reliable enough for real-world deployment? By combining Toby Ord's half-life framework with METR's exponential growth analysis, we can now predict specific reliability thresholds for AI automation.

The key insight is that AI agents currently fail at a constant rate per unit time (following a half-life model similar to radioactive decay), but this failure rate is improving exponentially—doubling in capability every 7 months. This creates predictable thresholds: current models (2025) can handle 50-minute tasks at 50% reliability, but achieving 90% reliability requires limiting tasks to just 7 minutes. By 2030, we project AI will handle month-long tasks, fundamentally transforming software development and business operations.

🎯 ELI5: The Core Concept

Imagine AI agents are like runners who get exponentially more tired as they run. Right now, they can run for about 50 minutes before having a 50% chance of collapsing. But here's the catch: to have a 90% chance of finishing, they can only run for 7 minutes. The good news? Every 7 months, AI agents can run twice as far before getting tired. So what takes 50 minutes today at 50% success will take 100 minutes in 7 months, 200 minutes in 14 months, and so on.

Part 1: The Half-Life Framework

Toby Ord's analysis introduces a powerful conceptual framework: AI agent performance follows survival analysis patterns similar to radioactive decay. Just as radioactive atoms have a constant probability of decay per unit time, AI agents have a constant probability of failure per minute of operation.

METR results on task length
Figure 1: METR's results showing exponential growth in the length of tasks AI agents can reliably complete. Every 7 months, frontier AI agents can solve tasks approximately twice as long.

The Mathematics of Failure

The half-life model reveals that AI systems experience what Ord calls a "constant hazard rate"—each minute of operation carries the same probability of failure, regardless of how long the agent has been running. This creates an exponential decay curve for success probability:

P(success) = e^(-λt)
where λ = ln(2) / half_life
and t = task duration

This mathematical relationship has profound implications for reliability engineering. It means that small increases in task duration lead to dramatic decreases in reliability. More importantly, it provides a quantitative framework for understanding exactly how much we need to reduce task duration to achieve target reliability levels.

Critical Reliability Thresholds

The exponential decay model reveals sharp reliability cliffs that every AI engineer must understand:

This explains why AI agents that seem "almost there" for long tasks still fail catastrophically—the difference between 50% and 90% success isn't linear, it's exponential.

Why AI Lacks Error Recovery

The constant hazard rate suggests current AI systems fundamentally lack error recovery mechanisms. Unlike humans who can recognize mistakes and backtrack, AI agents compound errors forward. Once an agent makes a mistake, it rarely recovers—each subsequent action builds on the flawed foundation. This is why breaking tasks into smaller, independently verifiable chunks is so critical for current systems.

Test suite composition
Figure 2: Test suite composition across different benchmarks, showing the distribution of software engineering, cybersecurity, reasoning, and ML tasks organized by human completion time.

Part 2: METR's Exponential Growth Analysis

While Ord's half-life model explains why AI fails, METR's comprehensive study of 13 frontier models from 2019-2025 reveals how fast this is improving. Their research introduces a critical metric: the "50% task-completion time horizon"—the duration of tasks (measured in human time) that AI can complete with 50% reliability.

METR's Three-Step Methodology

1️⃣
Task Creation

Design benchmark tasks with known human completion times

2️⃣
Human/AI Evaluation

Test both humans and AI agents on the same tasks

3️⃣
Statistical Modeling

Analyze success rates vs. task duration patterns

Source: METR Paper on arXiv

Figure 3: METR's three-step methodology for measuring AI agent time horizons, from task creation through human/AI evaluation to statistical modeling.

The Seven-Month Doubling Time

METR's analysis reveals remarkably consistent exponential growth: the 50% time horizon has doubled approximately every 7 months for the past 6 years. This isn't just incremental improvement—it's compound growth that fundamentally changes what's possible:

AI Capability Timeline

2019: ~3 minutes (Early GPT models, basic interactions)
2021: ~12 minutes (GPT-3, short coding tasks)
2023: ~25 minutes (ChatGPT/Claude, moderate complexity)
2025 (Current): ~50 minutes (Claude 3.7 Sonnet, complex tasks)
2027 (Projected): ~3.5 hours (Half-day automation)
2030 (Projected): ~1 month (Project-level automation)

What's Driving the Improvements?

METR's qualitative analysis identifies three key drivers of capability improvement:

Capability Drivers

  1. Enhanced Logical Reasoning: Better chain-of-thought processing, multi-step planning, and problem decomposition abilities
  2. Improved Tool Utilization: More sophisticated use of APIs, code execution environments, and external resources
  3. Greater Reliability: Better adaptation to unexpected outputs and self-correction mechanisms (though still limited)

Interestingly, the improvements aren't coming from longer context windows or more parameters alone—they're emerging from better reasoning architectures and training methodologies. This suggests the trend may continue even as we approach scaling limits.

Part 3: Synthesis - Predicting Reliability Thresholds

By combining both frameworks, we can now predict when specific use cases become viable. The key insight is that while the half-life (50% success duration) is growing exponentially, the reliability requirements for different applications create distinct adoption thresholds.

The Unified Model

Combining both frameworks gives us a powerful predictive model:

Task_Duration(year) = 50min × 2^((year-2025)×12/7)
Reliability(duration, year) = e^(-duration / Task_Duration(year))

This allows us to calculate exactly when any given task at any reliability threshold becomes feasible.

Reliability Projections by Use Case

Use Case Required Reliability Max Task Duration Viable Date Status
Code Review Assistant 80% 15 minutes Now Ready
Bug Fix Automation 90% 7 minutes Now Ready
Test Generation 85% 10 minutes Now Ready
Feature Implementation 80% 2 hours 2026 Soon
Full Sprint Automation 90% 1 day 2028 Future
Project Management 99% 1 week 2030+ Future

The Reliability Cliff Phenomenon

The exponential nature of both curves creates what we call "reliability cliffs"—sharp transitions where tasks go from impossible to trivial. A task that's completely infeasible today might become 90% reliable just 14 months later. This creates unique challenges for organizations trying to plan their AI adoption strategies.

Consider a 4-hour task that currently has near-zero success rate. By 2027, when the 50% horizon reaches 3.5 hours, this task will suddenly achieve ~40% success. Just 7 months later, it will reach 67% success. This rapid transition from "impossible" to "reliable" will happen across thousands of business processes simultaneously.

Part 4: Practical Engineering Implications

The Seven-Minute Rule

Current Best Practice (2025)

For 90% reliability with current frontier models, decompose all tasks into subtasks that can complete in under 7 minutes. This is the fundamental constraint that should guide all production AI system design today.

This constraint has profound implications for system architecture. Rather than giving an AI agent a complex task like "implement a new feature," successful systems break this into discrete, verifiable subtasks: "analyze requirements" (5 min), "design data model" (5 min), "implement model class" (7 min), "write unit tests" (5 min), etc.

Architecture Patterns for Current Reliability

Pattern Description Reliability Improvement Use Case
Task Decomposition Break into <5 minute subtasks 2-3× improvement All complex tasks
Parallel Ensemble Run 3 agents, take majority vote 90% → 97% Critical decisions
Human Checkpoints Review every 15-30 minutes Catches cascade failures Long-running tasks
State Verification Test after each subtask Early error detection Stateful operations
Rollback Capability Checkpoint before changes Recovery from failures Production systems

Common Anti-Patterns to Avoid

Understanding the half-life model helps identify several anti-patterns that doom AI projects to failure:

  1. The Marathon Task: Giving AI agents hours-long tasks without decomposition. With current models, a 3-hour task has <1% success rate.
  2. The Context Stuffing: Believing that larger context windows solve reliability. The half-life model shows failure rate is time-based, not context-based.
  3. The Success Extrapolation: Assuming that 80% success on 10-minute tasks means 64% on 20-minute tasks. In reality, it's exponential decay to ~40%.
  4. The Single Point of Failure: Running one agent for critical tasks. Use ensemble methods for important operations.

Part 5: Organizational Readiness Timeline

Based on our unified model, organizations should plan their AI adoption strategy around these reliability thresholds:

Investment Roadmap

Now (2025):
Invest in: Code review, test generation, documentation, bug triage
ROI: 20-40% developer productivity gains
Requirements: Human oversight, task decomposition
2026-2027:
Prepare for: Feature implementation, refactoring, multi-file changes
ROI: 2-3× developer velocity on routine tasks
Requirements: Robust testing infrastructure, rollback systems
2028-2030:
Plan for: Sprint automation, complex debugging, system design
ROI: 10× productivity on well-defined projects
Requirements: New development workflows, AI-first architecture

The Competitive Advantage Window

The exponential growth creates narrow windows of competitive advantage. Organizations that adopt AI capabilities 6-12 months early in each wave gain significant advantages, but waiting too long means competitors achieve the same capabilities. The key is identifying when reliability crosses the threshold for your specific use cases.

Strategic Planning Framework

For any critical business process:

  1. Measure the actual time skilled humans take to complete it
  2. Determine your required reliability threshold
  3. Use the model to predict when AI will achieve that threshold
  4. Begin preparation 12-18 months before the threshold date

Part 6: Limitations and Open Questions

Model Limitations

While our unified model provides valuable predictions, several limitations must be acknowledged:

Known Limitations

Open Research Questions

Several critical questions remain unanswered and represent active areas of research:

  1. Can we change the hazard rate itself? Current models have a constant failure rate per minute. Can architectural innovations create models with decreasing hazard rates (improving reliability over time)?
  2. What happens at the scaling limit? Will the 7-month doubling continue, slow down, or hit a wall?
  3. How do we measure creative tasks? The model works well for well-defined tasks but struggles with open-ended creative work.
  4. Can memory systems break the half-life constraint? Could external memory or retrieval systems fundamentally change the failure dynamics?

Part 7: Future Implications

The 2030 Inflection Point

Our model suggests 2030 represents a critical inflection point where month-long tasks become automatable. This isn't just quantitative improvement—it's a qualitative shift in what's possible. Month-long tasks include:

Organizations that haven't adapted their workflows by this point will face existential competitive pressure. The difference between companies leveraging month-long AI automation and those limited to day-long tasks will be similar to the current gap between digitized and paper-based businesses.

The Path to AGI?

Interestingly, the half-life model provides a quantitative framework for thinking about artificial general intelligence (AGI). Human professionals can maintain performance over years-long projects. If the 7-month doubling continues, AI would match this around 2035-2040. However, this assumes no fundamental breakthroughs in error recovery or memory systems.

What This Means for Society

The convergence of exponential growth with predictable reliability thresholds creates a unique moment in history. We can now predict, with reasonable confidence, when specific cognitive tasks will become automatable. This isn't science fiction—it's engineering planning with quantifiable error bounds. Organizations, educational institutions, and governments that understand these dynamics can prepare proactively rather than react defensively.

Conclusion

The synthesis of Ord's half-life framework with METR's growth analysis provides the first quantitative model for predicting AI reliability thresholds. Current AI agents fail at a constant rate per minute (half-life model), but this rate improves exponentially, doubling every 7 months.

For practitioners today, the message is clear: design systems around 7-minute subtasks for 90% reliability. For strategic planners, the roadmap is equally clear: prepare now for the capabilities coming in 2-3 years, not the limitations of today.

The exponential curves create both opportunity and urgency. The organizations that understand these dynamics—that recognize we're not approaching a plateau but riding an exponential curve—will define the next decade of technological progress.

The question isn't whether AI will become reliable enough for your use case. It's whether you'll be ready when it does.

Primary Sources

Toby Ord: "Is there a Half-Life for the Success Rates of AI Agents?"
Analysis of AI agent failure patterns using survival analysis and constant hazard rate models.

METR: "Measuring AI Ability to Complete Long Tasks"
Comprehensive study of 13 frontier models showing exponential growth in task completion capabilities.