This landmark theoretical paper answers a fundamental question: why does chain-of-thought (CoT) prompting work? The authors prove that CoT fundamentally expands what transformers can compute by enabling serial computation in an otherwise parallel architecture.
The core finding: constant-depth transformers with finite precision can only solve problems in AC⁰ (highly parallelizable problems) without CoT. But with T steps of CoT, the same transformer can solve any problem computable by boolean circuits of size T. With polynomial CoT steps, transformers achieve P/poly expressiveness—capable of solving essentially any efficiently computable problem.
Empirical validation on four benchmark tasks—modular addition, permutation composition, iterated squaring, and circuit value problems—confirms that CoT provides dramatic accuracy improvements specifically on tasks requiring sequential reasoning, while providing minimal benefit on parallelizable tasks.
Imagine a transformer as a factory with many workers doing tasks simultaneously—great for parallel work, but terrible for tasks where step 2 depends on step 1's result. A standard transformer is like asking 100 workers to each guess the final answer without talking to each other.
Chain-of-thought is like giving workers a shared notepad. Worker 1 writes their result, Worker 2 reads it and writes theirs, and so on. Each CoT step is one "round" of sequential work. A 10-step CoT means 10 sequential rounds—suddenly the parallel factory can solve problems that inherently require doing things in order, like multi-digit arithmetic where you must carry digits from right to left.
Transformers process information in parallel across all positions simultaneously. Each layer performs one "round" of parallel computation. This is incredibly efficient—but it creates a fundamental limitation for problems requiring sequential dependencies.
Many important problems are inherently serial—later steps fundamentally depend on earlier results:
7 × 8 × 9 × 10 requires carrying intermediate productsA transformer with constant depth cannot solve these problems regardless of width, because it lacks enough sequential "rounds" to propagate dependencies.
| Complexity Class | Informal Definition | Example Problems |
|---|---|---|
| AC⁰ | Constant-depth, polynomial-size circuits with unbounded fan-in AND/OR gates | Parity of fixed bits, simple pattern matching |
| TC⁰ | AC⁰ plus majority/threshold gates | Integer addition, multiplication |
| NC¹ | O(log n) depth, polynomial size, bounded fan-in | Formula evaluation, regular expressions |
| P | Polynomial time on sequential machines | Sorting, graph connectivity, linear programming |
| P/poly | P with polynomial-size advice string | All problems in P plus some undecidable problems |
The key relationship: AC⁰ ⊂ TC⁰ ⊆ NC¹ ⊆ P ⊆ P/poly
Without CoT, constant-depth transformers are stuck in AC⁰ or TC⁰. With polynomial CoT, they can reach P/poly. This is a massive expansion in computational power—the difference between "can only solve highly parallel problems" and "can solve essentially anything efficiently computable."
The paper establishes both upper bounds (what transformers cannot do) and lower bounds (what they can do with CoT), creating a complete picture of CoT's computational power.
Any decoder-only transformer with:
Can only solve problems in AC⁰.
This is tight—AC⁰ cannot compute parity of n bits, yet parity seems trivial. Constant-depth transformers without CoT cannot even reliably compute parity.
With O(log n)-bit precision (enough to index positions), constant-depth transformers can reach TC⁰, but no further. They can perform threshold operations but still cannot solve problems requiring sequential depth.
A constant-depth transformer with:
Can compute any problem solvable by a boolean circuit of size T.
Corollary: With polynomial CoT steps (T = poly(n)), constant-depth transformers can solve all problems in P/poly.
The proof constructs an explicit transformer that simulates boolean circuits:
Since any polynomial-time computation can be represented as a polynomial-size circuit, polynomial CoT steps suffice for P/poly.
This theorem explains when CoT helps:
The authors validate the theory on four tasks chosen specifically for their serial computation requirements:
| Task | Serial Depth | Description | Why It's Hard |
|---|---|---|---|
| Modular Addition | O(n) | Sum k numbers mod p | Carry propagation requires sequential steps |
| Permutation Composition | O(k) | Compose k permutations on n elements | Each composition depends on previous result |
| Iterated Squaring | O(k) | Compute x^(2^k) mod p | Each squaring depends on previous square |
| Circuit Value Problem | O(depth) | Evaluate boolean circuit | Gate outputs propagate through layers |
Across all four tasks, transformers with CoT significantly outperformed direct-answer transformers:
| Configuration | Direct Answer | With CoT | Improvement |
|---|---|---|---|
| 2-layer transformer, 5-permutation composition | ~0% | ~95% | +95% |
| 2-layer transformer, iterated squaring k=8 | Random | High accuracy | Dramatic |
| 4-layer transformer, circuit depth 16 | ~50% | ~90% | +40% |
A crucial empirical finding: shallow transformers + CoT match or exceed deep transformers without CoT. This suggests CoT is not just a prompting trick but a fundamental architectural enhancement.
The experiments confirm that CoT length should match problem serial depth:
This explains why "let's think step by step" works: it encourages the model to generate enough intermediate tokens to complete the serial computation.
| Design Choice | Without CoT | With CoT |
|---|---|---|
| Required depth | Must scale with problem serial depth | Constant depth sufficient |
| Inference cost | Fixed per token | Scales with reasoning steps |
| Problem coverage | Limited to AC⁰/TC⁰ | Up to P/poly |
| Training signal | Final answer only | Intermediate steps provide supervision |
This paper provides theoretical grounding for "test-time compute scaling"—the observation that spending more compute at inference (via CoT) can substitute for larger models. The mechanism is now clear: CoT steps provide the serial depth that model layers cannot.
A shallow model with 100 CoT steps can solve problems that would require ~100 layers to solve directly. This has profound implications for efficient deployment: train a smaller model, use CoT at inference for complex problems.
This paper provides the first rigorous theoretical explanation for why chain-of-thought prompting improves LLM reasoning. The core insight is elegant: CoT converts parallel architectures into serial computers by using intermediate tokens as a "scratchpad" for sequential computation.
Key takeaways for practitioners:
The era of "just make the model bigger" may give way to "make the model think longer"—and this paper explains why that works.
Chain of Thought Empowers Transformers to Solve Inherently Serial Problems
Li, Liu, Zhou, Ma — February 2024