A Definition of AGI
Quantifiable Framework for Measuring Artificial General Intelligence

Dan Hendrycks, Dawn Song, Christian Szegedy, Yoshua Bengio, Gary Marcus, Max Tegmark, Eric Schmidt, and 30+ researchers across 30 institutions
October 2025

Executive Summary

This landmark paper proposes the first rigorous, quantifiable framework for evaluating Artificial General Intelligence by grounding it in the most empirically validated model of human cognition: the Cattell-Horn-Carroll (CHC) theory. The authors define AGI as "matching the cognitive versatility and proficiency of a well-educated adult" and decompose this into ten core cognitive domains, each weighted equally at 10%.

The framework's most striking finding: current frontier models exhibit profoundly "jagged" cognitive profiles. GPT-4 scores 27% on the AGI scale while GPT-5 reaches 57%—but both score 0% on Long-Term Memory Storage, revealing a fundamental architectural bottleneck that prevents continuous learning. This single deficit forces current systems to "re-learn context in every interaction" and rely on compensatory mechanisms like massive context windows.

The paper distinguishes AGI from related concepts including Superintelligence, Pandemic AI, and Recursive AI, providing clarity that has been lacking in the field. The authors conclude that achieving a 100% AGI score remains "unlikely in the next year" due to barriers in abstract reasoning (ARC-AGI Challenge), world models, spatial navigation, hallucination reduction, and continual learning systems.

🎯 ELI5: What is AGI Really?

Imagine you're testing whether a robot can replace a well-educated office worker. You wouldn't just test if it can write emails—you'd check if it can read reports, do math, remember what you told it yesterday, recognize your face, understand your tone of voice, and think through new problems it's never seen before. This paper creates a "report card" with 10 subjects based on how human brains actually work. Current AI gets A's in some subjects (like general knowledge) but completely fails others (like remembering new things permanently). It's like a student who memorizes textbooks but forgets everything the moment the test ends—smart in some ways, but not ready to graduate.

GPT-4 and GPT-5 AGI capability comparison radar chart
Figure 1: Comparison of GPT-4 (27%) and GPT-5 (57%) across ten cognitive abilities. Note the jagged profile with complete deficits in Long-Term Memory Storage.

Part 1: The CHC-Based Definition Framework

The authors ground their AGI definition in the Cattell-Horn-Carroll (CHC) theory, which represents the most empirically validated model of human intelligence, developed over a century of psychometric research. This choice is significant—rather than creating an arbitrary AI-centric definition, the paper anchors AGI to what we know about human cognition.

The Ten Cognitive Domains

Each domain receives equal weighting of 10%, prioritizing breadth over depth in any single capability:

Ten core cognitive components diagram
Figure 2: The ten core cognitive components derived from CHC theory, forming the basis of the AGI assessment framework.

Why Equal Weighting?

The authors explicitly acknowledge that "equal weighting of broad abilities prioritizes breadth but represents one of many possible configurations." This design choice reflects a key insight: true general intelligence requires competence across all domains, not exceptional performance in just a few. An AI that scores 100% in mathematics but 0% in memory is not generally intelligent—it's a specialized tool.

Part 2: Evaluating Current Frontier Models

The paper provides illustrative AGI scores for GPT-4 and GPT-5, revealing the dramatic gaps between current capabilities and true AGI:

Cognitive Ability GPT-4 GPT-5 Gap to AGI
General Knowledge (K) 8% 9% 1-2%
Reading/Writing (RW) 6% 10% 0%
Mathematical (M) 4% 10% 0%
On-the-Spot Reasoning (R) 0% 7% 3%
Working Memory (WM) 2% 4% 6%
Long-Term Memory Storage (MS) 0% 0% 10%
Long-Term Memory Retrieval (MR) 4% 4% 6%
Visual Processing (V) 0% 4% 6%
Auditory Processing (A) 0% 6% 4%
Speed (S) 3% 3% 7%
Total AGI Score 27% 57% 43%

🚨 Critical Bottleneck: Long-Term Memory Storage at 0%

The paper identifies Long-Term Memory Storage as "perhaps the most significant bottleneck" for current AI systems. Both GPT-4 and GPT-5 score 0% in this domain, meaning they cannot permanently learn new information after deployment. This forces models to:

This architectural limitation means current AI cannot accumulate knowledge over time like humans do—a fundamental barrier to true general intelligence.

Part 3: Deep Dive into Each Cognitive Domain

General Knowledge (K)

General Knowledge domain breakdown
Figure 3: General Knowledge assessment structure spanning commonsense, science, social science, history, and culture.

Tests span commonsense reasoning, science (physics, chemistry, biology), social science (psychology, economics, geography, government), history, and culture. Benchmarks include AP exams (requiring a score of 5) and PIQA (requiring >85% accuracy). Current models perform relatively well here, with GPT-5 achieving near-ceiling performance.

Reading and Writing (RW)

Reading and Writing ability breakdown
Figure 4: Reading and Writing assessment across sentence, paragraph, and document levels.

Assessment occurs at three levels: sentence, paragraph, and document. Comprehension benchmarks include Winograd schemas, COQA, ReCoRD, and LAMBADA. Writing ability is tested via GRE Analytical Writing standards. GPT-5 achieves full marks in this domain.

Mathematical Ability (M)

Mathematical Ability breakdown
Figure 5: Mathematical assessment progression from arithmetic through advanced calculus.

Progresses from arithmetic (GSM8K >95% required) through algebra, geometry, probability, and calculus. Includes both rudimentary and proficient performance tiers using MATH dataset and AMC/AoPS problems. GPT-5's jump from 4% to 10% represents significant improvement in mathematical reasoning.

On-the-Spot Reasoning (R)

On-the-Spot Reasoning breakdown
Figure 6: Reasoning assessment covering deduction, induction, theory of mind, planning, and adaptation.

This domain covers deduction (LogiQA 2.0), induction (Raven's Progressive Matrices), theory of mind (FANToM, ToMBench), planning (Natural Plan, PlanBench), and adaptation (Wisconsin Card Sorting Test). GPT-4's 0% score here indicates fundamental reasoning limitations that GPT-5 only partially addresses.

Working Memory (WM)

Working Memory breakdown
Figure 7: Working Memory assessment across textual, auditory, visual, and cross-modal dimensions.

Tests across modalities: textual (recall and transformation), auditory (voice and sound recognition), visual (images, spatial navigation, long video Q&A via VSI-Bench), and cross-modal binding. Even GPT-5 only achieves 4%, suggesting context windows don't translate to robust working memory.

Long-Term Memory Storage (MS)

Long-Term Memory Storage breakdown
Figure 8: Long-Term Memory Storage assessment—the critical 0% bottleneck for all current models.

Assesses associative memory (cross-modal, personalization, procedural), meaningful memory (story and movie recall), and verbatim memory (sequences, sets, designs). The complete failure of current models here represents an architectural limitation: transformers cannot update their weights post-training.

Long-Term Memory Retrieval (MR)

Long-Term Memory Retrieval breakdown
Figure 9: Long-Term Memory Retrieval assessment including fluency measures and hallucination rates.

Measures fluency (ideational, expressional, alternative solutions, word, naming, figural) and precision (hallucination rates via Vectara HHEM, SimpleQA requiring <5% error rate). The 4% score reflects persistent hallucination problems even in frontier models.

Visual Processing (V)

Visual Processing breakdown
Figure 10: Visual Processing assessment covering perception, generation, reasoning, and spatial scanning.

Perception (ImageNet >85%, ImageNet-R >90%), generation (images and videos), reasoning (SPACE, SpatialViz-Bench >80%, IntPhysics 2, CharXiv), and spatial scanning. GPT-5's improvement to 4% reflects better multimodal capabilities but significant gaps remain.

Auditory Processing (A)

Auditory Processing breakdown
Figure 11: Auditory Processing assessment including speech recognition, phonetic coding, and musical ability.

Phonetic coding, speech recognition (LibriSpeech WER thresholds), voice quality recognition, rhythmic ability, and musical judgment. GPT-5's 6% score reflects native audio capabilities in multimodal models.

Speed (S)

Speed measures breakdown
Figure 12: Speed assessment covering ten distinct cognitive efficiency measures.

Ten distinct abilities: perceptual search/compare, reading/writing speed, number facility, simple/choice reaction time, inspection time, comparison speed, and pointer fluency. Both models score 3%, suggesting speed is not a primary focus of current architectures.

Part 4: Capability Contortions — How Models Mask Weaknesses

Compensatory Mechanisms in Current AI

The paper introduces the concept of "capability contortions"—techniques that mask fundamental deficits rather than solving them:

These workarounds achieve impressive results but represent engineering solutions to architectural limitations, not progress toward true AGI.

Part 5: What AGI Is Not — Related Concepts Distinguished

The paper provides crucial definitional clarity by distinguishing AGI from related but distinct concepts:

Concept Definition Relationship to AGI
Superintelligence Exceeding human performance across all cognitive domains Beyond AGI — requires >100% on framework
Pandemic AI Capable of engineering infectious pathogens Narrow capability, not general
Cyberwarfare AI Executing sophisticated cyber campaigns autonomously Narrow capability, not general
Self-Sustaining AI Operating autonomously indefinitely Operational characteristic, not cognitive
Recursive AI Conducting entire AI R&D independently Requires AGI plus specialized capabilities
Replacement AI Performing almost all human tasks more effectively Economic impact metric, not cognitive

Part 6: Barriers to Achieving 100% AGI

Unsolved Challenges

The framework identifies specific barriers that must be overcome for complete AGI:

Intelligence as a processor model
Figure 13: Conceptual model of intelligence as information processing, highlighting the knowledge-in, knowledge-out paradigm.

Part 7: Implications for AI Development

Development Priorities from This Framework

Timeline Implications

The authors conclude that achieving a 100% AGI score remains "unlikely in the next year." The 30 percentage point gap between GPT-4 and GPT-5 suggests rapid progress is possible, but the remaining 43 percentage points include the hardest challenges: long-term memory (requiring architectural breakthroughs), abstract reasoning (no clear path), and hallucination elimination (fundamental to reliability).

Conclusion

This paper represents a crucial step toward rigorous AGI discourse. By grounding the definition in established cognitive science and providing quantifiable metrics, the authors enable meaningful progress measurement and honest capability assessment.

The framework reveals that claims of imminent AGI should be viewed skeptically—even the most advanced models fail fundamental cognitive tests that well-educated adults pass routinely.

Primary Sources

A Definition of AGI
Hendrycks, Song, Szegedy, Bengio, Marcus, Tegmark, Schmidt et al., October 2025

Full HTML Version with Figures
ArXiv HTML rendering with complete methodology details