A Definition of AGI - Quantifiable Framework for Measuring Artificial General Intelligence

Executive Summary

This landmark paper proposes the first rigorous, quantifiable framework for evaluating Artificial General Intelligence by grounding it in the most empirically validated model of human cognition: the Cattell-Horn-Carroll (CHC) theory. The authors define AGI as "matching the cognitive versatility and proficiency of a well-educated adult" and decompose this into ten core cognitive domains, each weighted equally at 10%.

The framework's most striking finding: current frontier models exhibit profoundly "jagged" cognitive profiles. GPT-4 scores 27% on the AGI scale while GPT-5 reaches 57%—but both score 0% on Long-Term Memory Storage, revealing a fundamental architectural bottleneck that prevents continuous learning. This single deficit forces current systems to "re-learn context in every interaction" and rely on compensatory mechanisms like massive context windows.

The paper distinguishes AGI from related concepts including Superintelligence, Pandemic AI, and Recursive AI, providing clarity that has been lacking in the field. The authors conclude that achieving a 100% AGI score remains "unlikely in the next year" due to barriers in abstract reasoning (ARC-AGI Challenge), world models, spatial navigation, hallucination reduction, and continual learning systems.

🎯 ELI5: What is AGI Really?

Imagine you're testing whether a robot can replace a well-educated office worker. You wouldn't just test if it can write emails—you'd check if it can read reports, do math, remember what you told it yesterday, recognize your face, understand your tone of voice, and think through new problems it's never seen before. This paper creates a "report card" with 10 subjects based on how human brains actually work. Current AI gets A's in some subjects (like general knowledge) but completely fails others (like remembering new things permanently). It's like a student who memorizes textbooks but forgets everything the moment the test ends—smart in some ways, but not ready to graduate.

Part 1: The CHC-Based Definition Framework

The authors ground their AGI definition in the Cattell-Horn-Carroll (CHC) theory, which represents the most empirically validated model of human intelligence, developed over a century of psychometric research. This choice is significant—rather than creating an arbitrary AI-centric definition, the paper anchors AGI to what we know about human cognition.

The Ten Cognitive Domains

Each domain receives equal weighting of 10%, prioritizing breadth over depth in any single capability:

General Knowledge (K): Breadth of factual understanding across domains
Reading and Writing (RW): Language consumption and production at sentence, paragraph, and document levels
Mathematical Ability (M): From arithmetic through calculus
On-the-Spot Reasoning (R): Novel problem-solving without relying on learned schemas
Working Memory (WM): Information maintenance in active attention
Long-Term Memory Storage (MS): Permanently learning new information
Long-Term Memory Retrieval (MR): Accessing and retrieving stored knowledge accurately
Visual Processing (V): Image and video analysis and generation
Auditory Processing (A): Sound discrimination, speech recognition, musical processing
Speed (S): Cognitive task performance speed and efficiency

Why Equal Weighting?

The authors explicitly acknowledge that "equal weighting of broad abilities prioritizes breadth but represents one of many possible configurations." This design choice reflects a key insight: true general intelligence requires competence across all domains, not exceptional performance in just a few. An AI that scores 100% in mathematics but 0% in memory is not generally intelligent—it's a specialized tool.

Part 2: Evaluating Current Frontier Models

The paper provides illustrative AGI scores for GPT-4 and GPT-5, revealing the dramatic gaps between current capabilities and true AGI:

Cognitive Ability	GPT-4	GPT-5	Gap to AGI
General Knowledge (K)	8%	9%	1-2%
Reading/Writing (RW)	6%	10%	0%
Mathematical (M)	4%	10%	0%
On-the-Spot Reasoning (R)	0%	7%	3%
Working Memory (WM)	2%	4%	6%
Long-Term Memory Storage (MS)	0%	0%	10%
Long-Term Memory Retrieval (MR)	4%	4%	6%
Visual Processing (V)	0%	4%	6%
Auditory Processing (A)	0%	6%	4%
Speed (S)	3%	3%	7%
Total AGI Score	27%	57%	43%

🚨 Critical Bottleneck: Long-Term Memory Storage at 0%

The paper identifies Long-Term Memory Storage as "perhaps the most significant bottleneck" for current AI systems. Both GPT-4 and GPT-5 score 0% in this domain, meaning they cannot permanently learn new information after deployment. This forces models to:

Re-learn context in every interaction — users must re-explain preferences, past conversations, and domain-specific knowledge
Rely on massive context windows as a compensatory mechanism, stuffing working memory with what should be long-term knowledge
Use retrieval-augmented generation (RAG) to mask the absence of true learning

This architectural limitation means current AI cannot accumulate knowledge over time like humans do—a fundamental barrier to true general intelligence.

Part 3: Deep Dive into Each Cognitive Domain

General Knowledge (K)

Tests span commonsense reasoning, science (physics, chemistry, biology), social science (psychology, economics, geography, government), history, and culture. Benchmarks include AP exams (requiring a score of 5) and PIQA (requiring >85% accuracy). Current models perform relatively well here, with GPT-5 achieving near-ceiling performance.

Reading and Writing (RW)

Assessment occurs at three levels: sentence, paragraph, and document. Comprehension benchmarks include Winograd schemas, COQA, ReCoRD, and LAMBADA. Writing ability is tested via GRE Analytical Writing standards. GPT-5 achieves full marks in this domain.

Mathematical Ability (M)

Progresses from arithmetic (GSM8K >95% required) through algebra, geometry, probability, and calculus. Includes both rudimentary and proficient performance tiers using MATH dataset and AMC/AoPS problems. GPT-5's jump from 4% to 10% represents significant improvement in mathematical reasoning.

On-the-Spot Reasoning (R)

This domain covers deduction (LogiQA 2.0), induction (Raven's Progressive Matrices), theory of mind (FANToM, ToMBench), planning (Natural Plan, PlanBench), and adaptation (Wisconsin Card Sorting Test). GPT-4's 0% score here indicates fundamental reasoning limitations that GPT-5 only partially addresses.

Working Memory (WM)

Tests across modalities: textual (recall and transformation), auditory (voice and sound recognition), visual (images, spatial navigation, long video Q&A via VSI-Bench), and cross-modal binding. Even GPT-5 only achieves 4%, suggesting context windows don't translate to robust working memory.

Long-Term Memory Storage (MS)

Assesses associative memory (cross-modal, personalization, procedural), meaningful memory (story and movie recall), and verbatim memory (sequences, sets, designs). The complete failure of current models here represents an architectural limitation: transformers cannot update their weights post-training.

Long-Term Memory Retrieval (MR)

Measures fluency (ideational, expressional, alternative solutions, word, naming, figural) and precision (hallucination rates via Vectara HHEM, SimpleQA requiring <5% error rate). The 4% score reflects persistent hallucination problems even in frontier models.

Visual Processing (V)

Perception (ImageNet >85%, ImageNet-R >90%), generation (images and videos), reasoning (SPACE, SpatialViz-Bench >80%, IntPhysics 2, CharXiv), and spatial scanning. GPT-5's improvement to 4% reflects better multimodal capabilities but significant gaps remain.

Auditory Processing (A)

Phonetic coding, speech recognition (LibriSpeech WER thresholds), voice quality recognition, rhythmic ability, and musical judgment. GPT-5's 6% score reflects native audio capabilities in multimodal models.

Speed (S)

Ten distinct abilities: perceptual search/compare, reading/writing speed, number facility, simple/choice reaction time, inspection time, comparison speed, and pointer fluency. Both models score 3%, suggesting speed is not a primary focus of current architectures.

Part 4: Capability Contortions — How Models Mask Weaknesses

Compensatory Mechanisms in Current AI

The paper introduces the concept of "capability contortions"—techniques that mask fundamental deficits rather than solving them:

Massive Context Windows: Stuffing 128K+ tokens into context substitutes for true long-term memory, but forces repeated re-reading and incurs latency/cost
Retrieval-Augmented Generation (RAG): External memory stores mask hallucination problems and memory limitations but add infrastructure complexity
Chain-of-Thought Prompting: Serializes reasoning to compensate for weak working memory but increases token usage
Tool Use: Offloads computation (calculators, code execution) to mask reasoning deficits

These workarounds achieve impressive results but represent engineering solutions to architectural limitations, not progress toward true AGI.

Part 5: What AGI Is Not — Related Concepts Distinguished

The paper provides crucial definitional clarity by distinguishing AGI from related but distinct concepts:

Part 6: Barriers to Achieving 100% AGI

Concept	Definition	Relationship to AGI
Superintelligence	Exceeding human performance across all cognitive domains	Beyond AGI — requires >100% on framework
Pandemic AI	Capable of engineering infectious pathogens	Narrow capability, not general
Cyberwarfare AI	Executing sophisticated cyber campaigns autonomously	Narrow capability, not general
Self-Sustaining AI	Operating autonomously indefinitely	Operational characteristic, not cognitive
Recursive AI	Conducting entire AI R&D independently	Requires AGI plus specialized capabilities
Replacement AI	Performing almost all human tasks more effectively	Economic impact metric, not cognitive

Unsolved Challenges

The framework identifies specific barriers that must be overcome for complete AGI:

Abstract Reasoning: The ARC-AGI Challenge remains unsolved, requiring novel visual abstraction abilities
World Models: Intuitive physics understanding and causal reasoning about the physical world
Spatial Navigation Memory: Maintaining and updating mental maps over time
Hallucination Reduction: Current best models still exceed acceptable error rates on SimpleQA
Continual Learning: No current architecture can permanently learn from new experiences post-deployment

Part 7: Implications for AI Development

Development Priorities from This Framework

Architectural Innovation Needed: The 0% long-term memory score suggests transformers may not be sufficient for AGI without fundamental changes
Balanced Development: Equal weighting means progress on reasoning matters as much as language—narrow optimization won't reach AGI
Benchmark Design: Current benchmarks overweight language and knowledge; we need more working memory, reasoning, and learning evaluations
Honest Assessment: The framework exposes that even GPT-5 is less than 60% of the way to human-level general intelligence

Timeline Implications

The authors conclude that achieving a 100% AGI score remains "unlikely in the next year." The 30 percentage point gap between GPT-4 and GPT-5 suggests rapid progress is possible, but the remaining 43 percentage points include the hardest challenges: long-term memory (requiring architectural breakthroughs), abstract reasoning (no clear path), and hallucination elimination (fundamental to reliability).

Conclusion

This paper represents a crucial step toward rigorous AGI discourse. By grounding the definition in established cognitive science and providing quantifiable metrics, the authors enable meaningful progress measurement and honest capability assessment.

Framework: Ten equally-weighted cognitive domains derived from CHC theory, totaling 100%
Current State: GPT-4 at 27%, GPT-5 at 57%—significant progress but far from complete
Critical Gap: Long-Term Memory Storage at 0% represents an architectural limitation, not just a training deficit
Key Insight: Current AI has "jagged" profiles—exceptional at some tasks, completely failing others
Path Forward: Requires solving continual learning, abstract reasoning, and hallucination problems

The framework reveals that claims of imminent AGI should be viewed skeptically—even the most advanced models fail fundamental cognitive tests that well-educated adults pass routinely.

Primary Sources

A Definition of AGI
Hendrycks, Song, Szegedy, Bengio, Marcus, Tegmark, Schmidt et al., October 2025

Full HTML Version with Figures
ArXiv HTML rendering with complete methodology details

A Definition of AGIQuantifiable Framework for Measuring Artificial General Intelligence