Chain of Thought Empowers Transformers
to Solve Inherently Serial Problems

Zhiyuan Li, Hong Liu, Denny Zhou, Tengyu Ma
Stanford University, Toyota Technological Institute at Chicago, Google
February 2024

Executive Summary

This landmark theoretical paper answers a fundamental question: why does chain-of-thought (CoT) prompting work? The authors prove that CoT fundamentally expands what transformers can compute by enabling serial computation in an otherwise parallel architecture.

The core finding: constant-depth transformers with finite precision can only solve problems in AC⁰ (highly parallelizable problems) without CoT. But with T steps of CoT, the same transformer can solve any problem computable by boolean circuits of size T. With polynomial CoT steps, transformers achieve P/poly expressiveness—capable of solving essentially any efficiently computable problem.

Empirical validation on four benchmark tasks—modular addition, permutation composition, iterated squaring, and circuit value problems—confirms that CoT provides dramatic accuracy improvements specifically on tasks requiring sequential reasoning, while providing minimal benefit on parallelizable tasks.

🎯 ELI5: Why Chain-of-Thought Works

Imagine a transformer as a factory with many workers doing tasks simultaneously—great for parallel work, but terrible for tasks where step 2 depends on step 1's result. A standard transformer is like asking 100 workers to each guess the final answer without talking to each other.

Chain-of-thought is like giving workers a shared notepad. Worker 1 writes their result, Worker 2 reads it and writes theirs, and so on. Each CoT step is one "round" of sequential work. A 10-step CoT means 10 sequential rounds—suddenly the parallel factory can solve problems that inherently require doing things in order, like multi-digit arithmetic where you must carry digits from right to left.

Part 1: The Depth Problem in Transformers

Transformers process information in parallel across all positions simultaneously. Each layer performs one "round" of parallel computation. This is incredibly efficient—but it creates a fundamental limitation for problems requiring sequential dependencies.

The Serial Computation Gap

Many important problems are inherently serial—later steps fundamentally depend on earlier results:

Arithmetic: Computing 7 × 8 × 9 × 10 requires carrying intermediate products
Composition: Applying permutation A, then B, then C requires knowing A's result before applying B
Circuit evaluation: Computing a boolean circuit's output requires propagating values through gates in order

A transformer with constant depth cannot solve these problems regardless of width, because it lacks enough sequential "rounds" to propagate dependencies.

Complexity Class Background

Complexity Class	Informal Definition	Example Problems
AC⁰	Constant-depth, polynomial-size circuits with unbounded fan-in AND/OR gates	Parity of fixed bits, simple pattern matching
TC⁰	AC⁰ plus majority/threshold gates	Integer addition, multiplication
NC¹	O(log n) depth, polynomial size, bounded fan-in	Formula evaluation, regular expressions
P	Polynomial time on sequential machines	Sorting, graph connectivity, linear programming
P/poly	P with polynomial-size advice string	All problems in P plus some undecidable problems

The Hierarchy That Matters

The key relationship: AC⁰ ⊂ TC⁰ ⊆ NC¹ ⊆ P ⊆ P/poly

Without CoT, constant-depth transformers are stuck in AC⁰ or TC⁰. With polynomial CoT, they can reach P/poly. This is a massive expansion in computational power—the difference between "can only solve highly parallel problems" and "can solve essentially anything efficiently computable."

Part 2: Theoretical Results

The paper establishes both upper bounds (what transformers cannot do) and lower bounds (what they can do with CoT), creating a complete picture of CoT's computational power.

Upper Bounds: Transformers Without CoT

Theorem 3.1: Constant-Precision Upper Bound

Any decoder-only transformer with:

Constant depth (fixed number of layers)
Constant-bit precision (fixed numerical precision)
Polynomial embedding dimension poly(n)

Can only solve problems in AC⁰.

This is tight—AC⁰ cannot compute parity of n bits, yet parity seems trivial. Constant-depth transformers without CoT cannot even reliably compute parity.

Theorem 3.2: Log-Precision Upper Bound

With O(log n)-bit precision (enough to index positions), constant-depth transformers can reach TC⁰, but no further. They can perform threshold operations but still cannot solve problems requiring sequential depth.

Lower Bounds: What CoT Enables

Theorem 3.3: CoT Expressiveness (Main Result)

A constant-depth transformer with:

T steps of chain-of-thought
Constant-bit precision
O(log n) embedding dimension

Can compute any problem solvable by a boolean circuit of size T.

Corollary: With polynomial CoT steps (T = poly(n)), constant-depth transformers can solve all problems in P/poly.

Expressiveness(Transformer + T-step CoT) ⊇ SIZE[T]
where SIZE[T] = problems solvable by circuits of size T

Proof Intuition

The proof constructs an explicit transformer that simulates boolean circuits:

Encoding: Each CoT token encodes the output of one circuit gate
Simulation: The transformer's attention mechanism reads previous gate outputs
Computation: The MLP computes the current gate's function (AND, OR, NOT)
Iteration: Each CoT step evaluates one more "layer" of gates

Since any polynomial-time computation can be represented as a polynomial-size circuit, polynomial CoT steps suffice for P/poly.

Why This Matters for LLM Practitioners

This theorem explains when CoT helps:

CoT helps most on problems with high "serial depth"—where later steps depend on earlier results
CoT helps least on highly parallelizable problems that constant-depth networks already solve
Longer CoT = more serial steps—if your problem needs 50 sequential operations, you need ~50 CoT tokens

Part 3: Empirical Validation

The authors validate the theory on four tasks chosen specifically for their serial computation requirements:

Task	Serial Depth	Description	Why It's Hard
Modular Addition	O(n)	Sum k numbers mod p	Carry propagation requires sequential steps
Permutation Composition	O(k)	Compose k permutations on n elements	Each composition depends on previous result
Iterated Squaring	O(k)	Compute x^(2^k) mod p	Each squaring depends on previous square
Circuit Value Problem	O(depth)	Evaluate boolean circuit	Gate outputs propagate through layers

Experimental Results

Key Finding: CoT Dramatically Improves Serial Tasks

Across all four tasks, transformers with CoT significantly outperformed direct-answer transformers:

Permutation composition: Direct answer fails completely; CoT achieves near-perfect accuracy
Iterated squaring: Shallow transformers fail without CoT; succeed with CoT matching theoretical predictions
Circuit value: Performance scales with CoT length, matching circuit depth requirements

Configuration	Direct Answer	With CoT	Improvement
2-layer transformer, 5-permutation composition	~0%	~95%	+95%
2-layer transformer, iterated squaring k=8	Random	High accuracy	Dramatic
4-layer transformer, circuit depth 16	~50%	~90%	+40%

Depth vs CoT Trade-off

A crucial empirical finding: shallow transformers + CoT match or exceed deep transformers without CoT. This suggests CoT is not just a prompting trick but a fundamental architectural enhancement.

Practical Implication: CoT Length Matters

The experiments confirm that CoT length should match problem serial depth:

Too few CoT steps → insufficient serial computation → failure
Sufficient CoT steps → matches required depth → success
Extra CoT steps → no additional benefit but no harm

This explains why "let's think step by step" works: it encourages the model to generate enough intermediate tokens to complete the serial computation.

Part 4: Implications for AI Development

When to Use Chain-of-Thought

CoT High-Value Scenarios

Multi-step arithmetic: Each digit operation depends on carries from previous digits
Logical deduction: Conclusions depend on chains of premises
Planning: Later steps depend on earlier state changes
Code execution tracing: Variable values depend on execution history
Multi-hop reasoning: Each hop's answer feeds into the next query

CoT Low-Value Scenarios

Pattern matching: Recognizing if input matches template (parallelizable)
Retrieval: Looking up facts doesn't require serial computation
Classification: Single-step decisions from features
Simple transformations: Applying rules that don't depend on each other

Implications for Model Architecture

Design Choice	Without CoT	With CoT
Required depth	Must scale with problem serial depth	Constant depth sufficient
Inference cost	Fixed per token	Scales with reasoning steps
Problem coverage	Limited to AC⁰/TC⁰	Up to P/poly
Training signal	Final answer only	Intermediate steps provide supervision

The "Test-Time Compute" Connection

This paper provides theoretical grounding for "test-time compute scaling"—the observation that spending more compute at inference (via CoT) can substitute for larger models. The mechanism is now clear: CoT steps provide the serial depth that model layers cannot.

A shallow model with 100 CoT steps can solve problems that would require ~100 layers to solve directly. This has profound implications for efficient deployment: train a smaller model, use CoT at inference for complex problems.

Part 5: Limitations and Open Questions

What This Paper Does Not Claim

Not about training: The theory is about expressiveness, not learnability. A transformer can compute something doesn't mean it will learn to
Not about practical LLMs: Results are for idealized transformers; real LLMs have noise, finite training, distribution shift
Not about natural language: The tasks are algorithmic; natural language reasoning may involve different phenomena

Open Questions

Optimal CoT length: Can we predict the minimum CoT length needed for a given problem?
Learning CoT: How do models learn to generate useful intermediate steps?
CoT efficiency: Can compressed CoT (fewer tokens) achieve the same serial depth?
Natural language serial depth: What is the "serial depth" of typical reasoning tasks in natural language?

Conclusion

This paper provides the first rigorous theoretical explanation for why chain-of-thought prompting improves LLM reasoning. The core insight is elegant: CoT converts parallel architectures into serial computers by using intermediate tokens as a "scratchpad" for sequential computation.

Key takeaways for practitioners:

CoT is not a hack—it's fundamental: It expands transformer expressiveness from AC⁰ to P/poly
CoT length should match serial depth: More steps enable solving "deeper" problems
CoT helps most on inherently serial tasks: Arithmetic, multi-step reasoning, planning
Shallow models + CoT can match deep models: Trade-off between parameters and inference compute

The era of "just make the model bigger" may give way to "make the model think longer"—and this paper explains why that works.

Primary Sources

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems
Li, Liu, Zhou, Ma — February 2024

Chain of Thought Empowers Transformersto Solve Inherently Serial Problems