Chain of Thought Empowers Transformers
to Solve Inherently Serial Problems

Zhiyuan Li, Hong Liu, Denny Zhou, Tengyu Ma
Stanford University, Toyota Technological Institute at Chicago, Google
February 2024

Executive Summary

This landmark theoretical paper answers a fundamental question: why does chain-of-thought (CoT) prompting work? The authors prove that CoT fundamentally expands what transformers can compute by enabling serial computation in an otherwise parallel architecture.

The core finding: constant-depth transformers with finite precision can only solve problems in AC⁰ (highly parallelizable problems) without CoT. But with T steps of CoT, the same transformer can solve any problem computable by boolean circuits of size T. With polynomial CoT steps, transformers achieve P/poly expressiveness—capable of solving essentially any efficiently computable problem.

Empirical validation on four benchmark tasks—modular addition, permutation composition, iterated squaring, and circuit value problems—confirms that CoT provides dramatic accuracy improvements specifically on tasks requiring sequential reasoning, while providing minimal benefit on parallelizable tasks.

🎯 ELI5: Why Chain-of-Thought Works

Imagine a transformer as a factory with many workers doing tasks simultaneously—great for parallel work, but terrible for tasks where step 2 depends on step 1's result. A standard transformer is like asking 100 workers to each guess the final answer without talking to each other.

Chain-of-thought is like giving workers a shared notepad. Worker 1 writes their result, Worker 2 reads it and writes theirs, and so on. Each CoT step is one "round" of sequential work. A 10-step CoT means 10 sequential rounds—suddenly the parallel factory can solve problems that inherently require doing things in order, like multi-digit arithmetic where you must carry digits from right to left.

Part 1: The Depth Problem in Transformers

Transformers process information in parallel across all positions simultaneously. Each layer performs one "round" of parallel computation. This is incredibly efficient—but it creates a fundamental limitation for problems requiring sequential dependencies.

The Serial Computation Gap

Many important problems are inherently serial—later steps fundamentally depend on earlier results:

A transformer with constant depth cannot solve these problems regardless of width, because it lacks enough sequential "rounds" to propagate dependencies.

Complexity Class Background

Complexity Class Informal Definition Example Problems
AC⁰ Constant-depth, polynomial-size circuits with unbounded fan-in AND/OR gates Parity of fixed bits, simple pattern matching
TC⁰ AC⁰ plus majority/threshold gates Integer addition, multiplication
NC¹ O(log n) depth, polynomial size, bounded fan-in Formula evaluation, regular expressions
P Polynomial time on sequential machines Sorting, graph connectivity, linear programming
P/poly P with polynomial-size advice string All problems in P plus some undecidable problems

The Hierarchy That Matters

The key relationship: AC⁰ ⊂ TC⁰ ⊆ NC¹ ⊆ P ⊆ P/poly

Without CoT, constant-depth transformers are stuck in AC⁰ or TC⁰. With polynomial CoT, they can reach P/poly. This is a massive expansion in computational power—the difference between "can only solve highly parallel problems" and "can solve essentially anything efficiently computable."

Part 2: Theoretical Results

The paper establishes both upper bounds (what transformers cannot do) and lower bounds (what they can do with CoT), creating a complete picture of CoT's computational power.

Upper Bounds: Transformers Without CoT

Theorem 3.1: Constant-Precision Upper Bound

Any decoder-only transformer with:

Can only solve problems in AC⁰.

This is tight—AC⁰ cannot compute parity of n bits, yet parity seems trivial. Constant-depth transformers without CoT cannot even reliably compute parity.

Theorem 3.2: Log-Precision Upper Bound

With O(log n)-bit precision (enough to index positions), constant-depth transformers can reach TC⁰, but no further. They can perform threshold operations but still cannot solve problems requiring sequential depth.

Lower Bounds: What CoT Enables

Theorem 3.3: CoT Expressiveness (Main Result)

A constant-depth transformer with:

Can compute any problem solvable by a boolean circuit of size T.

Corollary: With polynomial CoT steps (T = poly(n)), constant-depth transformers can solve all problems in P/poly.

Expressiveness(Transformer + T-step CoT) ⊇ SIZE[T]
where SIZE[T] = problems solvable by circuits of size T

Proof Intuition

The proof constructs an explicit transformer that simulates boolean circuits:

  1. Encoding: Each CoT token encodes the output of one circuit gate
  2. Simulation: The transformer's attention mechanism reads previous gate outputs
  3. Computation: The MLP computes the current gate's function (AND, OR, NOT)
  4. Iteration: Each CoT step evaluates one more "layer" of gates

Since any polynomial-time computation can be represented as a polynomial-size circuit, polynomial CoT steps suffice for P/poly.

Why This Matters for LLM Practitioners

This theorem explains when CoT helps:

Part 3: Empirical Validation

The authors validate the theory on four tasks chosen specifically for their serial computation requirements:

Task Serial Depth Description Why It's Hard
Modular Addition O(n) Sum k numbers mod p Carry propagation requires sequential steps
Permutation Composition O(k) Compose k permutations on n elements Each composition depends on previous result
Iterated Squaring O(k) Compute x^(2^k) mod p Each squaring depends on previous square
Circuit Value Problem O(depth) Evaluate boolean circuit Gate outputs propagate through layers

Experimental Results

Key Finding: CoT Dramatically Improves Serial Tasks

Across all four tasks, transformers with CoT significantly outperformed direct-answer transformers:

Configuration Direct Answer With CoT Improvement
2-layer transformer, 5-permutation composition ~0% ~95% +95%
2-layer transformer, iterated squaring k=8 Random High accuracy Dramatic
4-layer transformer, circuit depth 16 ~50% ~90% +40%

Depth vs CoT Trade-off

A crucial empirical finding: shallow transformers + CoT match or exceed deep transformers without CoT. This suggests CoT is not just a prompting trick but a fundamental architectural enhancement.

Practical Implication: CoT Length Matters

The experiments confirm that CoT length should match problem serial depth:

This explains why "let's think step by step" works: it encourages the model to generate enough intermediate tokens to complete the serial computation.

Part 4: Implications for AI Development

When to Use Chain-of-Thought

CoT High-Value Scenarios

CoT Low-Value Scenarios

Implications for Model Architecture

Design Choice Without CoT With CoT
Required depth Must scale with problem serial depth Constant depth sufficient
Inference cost Fixed per token Scales with reasoning steps
Problem coverage Limited to AC⁰/TC⁰ Up to P/poly
Training signal Final answer only Intermediate steps provide supervision

The "Test-Time Compute" Connection

This paper provides theoretical grounding for "test-time compute scaling"—the observation that spending more compute at inference (via CoT) can substitute for larger models. The mechanism is now clear: CoT steps provide the serial depth that model layers cannot.

A shallow model with 100 CoT steps can solve problems that would require ~100 layers to solve directly. This has profound implications for efficient deployment: train a smaller model, use CoT at inference for complex problems.

Part 5: Limitations and Open Questions

What This Paper Does Not Claim

Open Questions

Conclusion

This paper provides the first rigorous theoretical explanation for why chain-of-thought prompting improves LLM reasoning. The core insight is elegant: CoT converts parallel architectures into serial computers by using intermediate tokens as a "scratchpad" for sequential computation.

Key takeaways for practitioners:

The era of "just make the model bigger" may give way to "make the model think longer"—and this paper explains why that works.

Primary Sources

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems
Li, Liu, Zhou, Ma — February 2024