The Evolution of LLM Reasoning

A metro map of how language models learned to think step by step

From Chain-of-Thought to Large Reasoning Models • 2022–2025

Chain-of-Thought
Action & Agents
Tree & Search
Program-Based
Surveys & Meta
2022 2023 2024 2025 CoT 17.7k cites Self-Consistency 4.2k cites Least-to-Most 1.4k cites CoT w/o Prompt 168 cites Sys1→Sys2 Survey '25 ReAct 5.9k cites Reflexion 2.9k cites LRM Survey 2025 Tree of Thoughts 4.9k cites LLM-ToT 314 cites Efficient Survey 2025 Program of Thoughts 1.1k cites

🚀 The Frontier: What's Next

The reasoning landscape is evolving rapidly. Here are the key trends shaping 2025 and beyond.

Test-Time Compute Scaling

The new scaling law: more thinking tokens = better results. o1/o3 and R1 prove you can trade inference cost for capability. The frontier is now efficiency—getting the same gains with fewer tokens.

o1 vs Gemini on same task: 138K vs 96 tokens
🔄

Verify-then-Generate

The winning pattern: generate candidates, verify with a separate model/tool, iterate. External verifiers (code execution, formal proofs, unit tests) unlock gains independent of base model quality.

Reflexion on HumanEval: +24% with self-verification
🧬

Reasoning Distillation

Large reasoning models (o1, R1) generate training data for smaller models. DeepSeek-R1-Distill-Qwen-7B matches GPT-4o on math. The moat isn't the model—it's the reasoning traces.

7B distilled vs 70B base: Often comparable
🎯

Adaptive Compute

Not all problems need the same thinking budget. The next frontier: models that know when to think longer. Easy questions get fast answers; hard ones get extended reasoning.

Goal: Right-size thinking per query
🔗

Tool-Augmented Reasoning

Pure language reasoning hits limits. The winners combine LLM reasoning with code execution (PoT), search (ReAct), and formal verification. Hybrid systems dominate benchmarks.

PoT vs CoT on GSM8K: +22% with code execution
📉

The Efficiency Crisis

"Slow thinking" generates massive token counts. Research explores: compressed CoT (say less), distillation (smaller models), speculative decoding (faster generation). Cost per reasoning step must drop 10-100x.

Current challenge: $0.10+ per complex query
The Bottom Line: We've proven LLMs can reason. The 2025 frontier is making that reasoning efficient, reliable, and verifiable. The techniques on this page are the building blocks.

🎯 The Decision Framework: Which Technique Should You Use?

Don't just learn these techniques—know when to use each one. This decision tree distills 10 papers into actionable guidance.

What type of problem are you solving?
Math/Computation
Does it require calculation?
Yes
→ Program of Thoughts

Let Python do the math. +22% on GSM8K.

No, just reasoning
→ Chain-of-Thought

Step-by-step natural language reasoning.

Needs External Info
Can model hallucinate facts?
No, need real data
→ ReAct

Interleave reasoning with tool calls.

Can self-correct
→ Reflexion

Try, fail, reflect, retry. +24% on HumanEval.

Puzzles/Planning
Might first approach be wrong?
Yes, need backtracking
→ Tree of Thoughts

Explore multiple paths. 4% → 74% on Game of 24.

No, path is clear
→ Chain-of-Thought

Single path reasoning is enough.

Complex/Compositional
→ Least-to-Most

Decompose, solve easy→hard. 99.7% on SCAN.

⚡ The Quick Reference Matrix

Technique Best For Cost Key Gain Don't Use When
CoT Multi-step reasoning 1x +40% GSM8K Simple lookups, pattern matching
Self-Consistency High-stakes accuracy 10-40x +16% over CoT Budget-constrained, latency-critical
Least-to-Most Compositional tasks 2-5x +84% SCAN Non-decomposable problems
ReAct Tool use, grounding 3-10x +10% FEVER Closed-book reasoning
Reflexion Iterative improvement 2-5x +24% HumanEval No feedback signal available
Tree of Thoughts Search/exploration 10-100x +70% Game24 Simple problems, cost matters
Program of Thoughts Math-heavy tasks 1x + exec +22% GSM8K Non-computational reasoning

🚨 The Mistakes Everyone Makes

❌ Using CoT for everything

The mistake: Applying "think step by step" to simple factual questions.

Why it's wrong: CoT adds latency and can confabulate reasoning for simple lookups. "What's the capital of France?" doesn't need step-by-step.

Fix: Classify queries first. Use CoT only for multi-step reasoning.

❌ ToT on simple problems

The mistake: Using Tree of Thoughts for problems where CoT suffices.

Why it's wrong: ToT is 10-100x more expensive. If CoT works, you're burning money.

Fix: Try CoT first. Only use ToT when CoT consistently fails.

❌ Reflexion without feedback

The mistake: Trying to use Reflexion without a way to evaluate attempts.

Why it's wrong: Reflexion needs signal. No feedback = no learning = just burning tokens.

Fix: Ensure you have tests, verifiers, or ground truth before using Reflexion.

❌ Self-Consistency with temp=0

The mistake: Running Self-Consistency with greedy decoding.

Why it's wrong: Temperature=0 produces identical outputs. You're just generating the same answer N times.

Fix: Use temperature 0.5-0.7 for diversity.

❌ PoT for non-computational tasks

The mistake: Generating code for commonsense reasoning questions.

Why it's wrong: "Is a penguin a bird?" doesn't benefit from Python. The model will generate silly code.

Fix: Only use PoT when there's actual computation to offload.

❌ No max_steps in ReAct

The mistake: Running ReAct agents without step limits.

Why it's wrong: Agents can loop forever, searching repeatedly without finishing.

Fix: Always set max_steps (5-10 for most tasks).

🏭 Production Playbook

1
Start simple, scale complexity

Begin with basic prompting → add CoT if accuracy is low → add Self-Consistency for high-stakes → consider ToT only for genuinely hard problems. Most production systems don't need ToT.

2
Build a query classifier

Not all queries need the same treatment. A lightweight classifier (or even a regex) can route simple queries to fast paths and complex queries to expensive techniques.

3
Combine techniques strategically

PoT + Self-Consistency: Generate multiple code solutions, vote on outputs. ReAct + Reflexion: Retry failed tool calls with reflection. The techniques compose.

4
Measure what matters

Track: accuracy, latency (p50/p95/p99), cost per query, failure modes. A technique that's +5% accuracy but +1000% cost may not be worth it.

Chain-of-Thought Line

CoT 📚 17,700+ citations

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Wei et al. (inc. Denny Zhou) • Google Brain • NeurIPS 2022

The foundational paper that started it all. Adding few-shot examples with explicit step-by-step reasoning dramatically boosts performance on arithmetic, symbolic, and commonsense reasoning.

The Core Idea: Why This Paper Changed Everything

Before CoT, the standard approach was direct prompting: "Q: What is 23 × 17? A: 391". The model had to make a single forward pass and hope it had "memorized" enough similar patterns. This paper showed that by asking the model to show its work, you unlock reasoning capabilities that simply don't exist in direct prompting.

💡 Key Learning

LLMs don't actually "reason" in a single forward pass—they pattern match. CoT works because it breaks complex problems into simpler pattern-matching steps that the model can do in one pass. Each intermediate step is easier than the whole problem.

How It Works: Step by Step

1
Provide Few-Shot Examples with Reasoning
Instead of just showing input→output pairs, show input→reasoning→output. The model learns the pattern of thinking through problems.
2
Model Generates Intermediate Steps
When given a new problem, the model follows the demonstrated pattern and generates its own reasoning chain before the answer.
3
Each Step Becomes Context for the Next
As the model generates each reasoning step, that text becomes part of the context for generating the next step—effectively "working memory."
4
Final Answer Follows from Reasoning
The answer is now a simple extraction from the reasoning chain, not a complex leap from the original problem.

Example Prompt: Standard vs Chain-of-Thought

❌ Standard Prompting Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? A: 11 Q: A juggler has 16 balls. Half are golf balls, and half of the golf balls are blue. How many blue golf balls are there? A:
✓ Chain-of-Thought Prompting Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11. Q: A juggler has 16 balls. Half are golf balls, and half of the golf balls are blue. How many blue golf balls are there? A:

Benchmark Results

BenchmarkStandardCoTGain
GSM8K (math)17.9%58.0%+40.1%
SVAMP (math)58.8%78.2%+19.4%
ASDiv (math)71.3%78.6%+7.3%
AQuA (algebra)25.2%35.8%+10.6%
StrategyQA65.4%73.2%+7.8%

Why It Works: The Theoretical Insight

Transformers are inherently parallel architectures—they process all tokens simultaneously. This means they can only compute functions in the complexity class TC⁰ (constant depth circuits).

Many problems (like multi-digit arithmetic) are inherently serial—you must compute intermediate values before the final answer. CoT effectively gives the transformer "extra layers" through the sequential generation of tokens.

The Deep Insight

CoT length should match the problem's "serial depth." A 10-step calculation needs ~10 reasoning steps. This explains why CoT helps arithmetic but barely affects pattern matching.

Key Discoveries

  • Emergent ability: CoT only helps models ≥100B parameters. Smaller models generate plausible-looking but wrong reasoning.
  • 8 examples suffice: More examples don't help much after 8; the model gets the pattern.
  • Quality over quantity: Better reasoning examples matter more than more examples.
  • Zero-shot works: Just adding "Let's think step by step" works (but worse than few-shot).
  • Errors cascade: One wrong step usually means wrong final answer.

Limitations & When NOT to Use

  • Simple lookups: "What is the capital of France?" doesn't need reasoning.
  • Pattern matching: Tasks solvable by retrieval don't benefit.
  • Small models: Below 60B parameters, CoT often hurts performance.
  • Time-sensitive apps: Generating reasoning adds latency and cost.
  • Factual questions: CoT can confabulate convincing-but-wrong reasoning.

Practical Implementation Tips

🎯 Tip 1: Match Step Granularity

Your example reasoning should be at the right granularity. Too coarse = model skips important steps. Too fine = unnecessary verbosity and cost.

🎯 Tip 2: Include Diverse Examples

Cover different reasoning patterns in your examples. If all examples use the same approach, the model won't generalize well.

🎯 Tip 3: Verify the Reasoning

A correct final answer doesn't mean correct reasoning. Check intermediate steps—bad reasoning can get lucky.

Self-Consistency 📚 4,200+ citations

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Wang et al. (inc. Denny Zhou) • Google Research • ICLR 2023

Instead of greedy decoding, sample multiple diverse reasoning paths and take a majority vote. Turns CoT into an ensemble method with significant accuracy gains.

The Core Insight: Wisdom of Crowds for Reasoning

Self-Consistency is beautifully simple: correct reasoning paths are more likely to agree on the answer than incorrect ones. If you ask a model to solve a problem multiple times with some randomness, the correct answer should appear more often than any particular wrong answer.

💡 Key Learning

This technique turns CoT into an ensemble method. Different reasoning paths are like different "voters"—errors are random and cancel out, while the truth converges. It's the same principle behind random forests and ensemble learning.

How It Works: The Algorithm

1
Sample Multiple Reasoning Paths
Instead of greedy decoding (temperature=0), use temperature 0.5-0.7 to generate diverse reasoning chains. Sample N times (typically 5-40).
2
Extract Final Answers
Parse the final answer from each reasoning chain. The reasoning paths may differ wildly, but we only care about the final answer.
3
Majority Vote
Return the most frequent answer. Ties can be broken randomly or by taking the first occurrence.
Problem Path 1: ... → 42
Problem Path 2: ... → 42
Problem Path 3: ... → 37
Problem Path 4: ... → 42
────────────────────────────────
Majority Vote: 42 (3/4)

Benchmark Results

BenchmarkCoTSelf-Cons.Gain
GSM8K58.0%74.4%+16.4%
SVAMP78.2%86.6%+8.4%
AQuA35.8%48.0%+12.2%
StrategyQA73.2%79.1%+5.9%
ARC-c85.2%88.7%+3.5%

Why It Works: The Math

Assume the model has probability p of generating a correct reasoning path. With majority voting over n samples, the probability of a correct final answer approaches 1 as n increases (if p > 0.5).

The Statistics

If each sample has 60% accuracy independently, 5 samples give ~68% accuracy, 10 samples give ~75%, and 40 samples give ~83%. This is the Condorcet jury theorem applied to reasoning.

Crucially, different reasoning paths make different mistakes. Path A might misread a number, Path B might use wrong formula, but the correct answer emerges from the noise.

The Scaling Curve

Performance scales logarithmically with samples:

  • 1 sample: Baseline CoT accuracy
  • 5 samples: ~70% of potential gain
  • 10 samples: ~85% of potential gain
  • 40 samples: ~95% of potential gain
  • 100+ samples: Diminishing returns
💡 Practical Insight

For production, 5-10 samples is the sweet spot—most of the accuracy gain with manageable cost. Use 40+ only when accuracy is critical and cost isn't.

Implementation Details

def self_consistency(prompt, n_samples=10, temp=0.7): answers = [] for _ in range(n_samples): response = llm.generate(prompt, temperature=temp) answer = extract_answer(response) answers.append(answer) # Majority vote from collections import Counter vote = Counter(answers).most_common(1)[0][0] return vote

Critical Parameters

  • Temperature 0.5-0.7: Too low = identical paths, too high = nonsense
  • Top-p 0.95: Helps diversity without degeneracy
  • Parallel sampling: All samples can run concurrently
  • Answer extraction: Must reliably parse final answer (regex, delimiter)

Limitations & Gotchas

  • Cost: 10x samples = 10x tokens = 10x cost
  • Systematic errors: If the model is systematically wrong, voting won't help
  • Answer space: Works best when answers are discrete (numbers, multiple choice)
  • Open-ended tasks: Hard to vote on free-form text generation
  • Parse failures: Must handle cases where answer can't be extracted

Advanced Variations

📊 Weighted Voting

Weight votes by model confidence (log probability of the reasoning chain). Paths the model is more "sure" about get more weight.

🔄 Verifier Models

Train a separate model to score reasoning quality. Use scores instead of raw votes. OpenAI's approach for math reasoning.

🎯 Best-of-N

Instead of voting, use a reward model to pick the best single response. More flexible for open-ended tasks.

Least-to-Most 📚 1,400+ citations

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Zhou et al. • Google Research • ICLR 2023

Decompose complex problems into simpler subproblems, solve from easiest to hardest. Each solution feeds into the next, enabling better generalization to harder problems.

The Key Insight: Structured Decomposition Beats Linear Reasoning

CoT asks the model to think through a problem in one shot—it figures out the structure while solving. Least-to-Most separates decomposition from solving: first figure out what subproblems exist, then solve them in order from simplest to most complex.

💡 Key Learning

The magic is in the ordering: solving easier subproblems first provides scaffolding for harder ones. Each solution becomes context for the next, building up to the final answer. This mimics how humans tackle complex problems—break them down, start simple.

The Two-Stage Process

1
Decomposition Stage
Ask: "To solve [original problem], what subproblems do we need to solve first?" The model outputs an ordered list from simplest to most complex.
2
Sequential Solving Stage
Solve subproblem 1 → add solution to context → solve subproblem 2 with that context → repeat until the final (hardest) subproblem is solved.
Original Problem
↓ Decompose
Sub-1 (easy) Sub-2 Sub-3 (hard)
↓ Solve sequentially
Solve Sub-1 +context → Solve Sub-2 +context → Solve Sub-3

Final Answer

Example: Symbol Mapping (SCAN)

Task "jump around left twice" → actions Stage 1: Decomposition Q: "jump around left twice" can be solved by first solving what subproblems? A: We need to solve: 1. "left" (simplest) 2. "jump left" 3. "jump around left" 4. "jump around left twice" (original) Stage 2: Solving Q: "left" → ? A: TURN LEFT Q: "left" means TURN LEFT. "jump left" → ? A: TURN LEFT JUMP Q: "left" means TURN LEFT. "jump left" means TURN LEFT JUMP. "jump around left" → ? A: TURN LEFT JUMP TURN LEFT JUMP TURN LEFT JUMP TURN LEFT JUMP Q: [all previous context] "jump around left twice" → ? A: [full solution]

Benchmark Results

BenchmarkCoTLeast-to-MostGain
SCAN (length gen.)16.0%99.7%+83.7%
DROP (reading comp.)78.7%82.3%+3.6%
GSM8K (math)58.0%62.4%+4.4%
CFQ (compositional)33.1%94.3%+61.2%
The SCAN Result is Remarkable

CoT completely fails at length generalization (16%)—if the test sequence is longer than training examples, it can't cope. L2M achieves 99.7% by learning to decompose rather than memorize.

Why Length Generalization Works

Standard CoT fails on problems longer than training examples because it learns to match patterns, not procedures. L2M forces the model to learn compositional rules:

  • Base cases: Simple primitives like "left" → TURN LEFT
  • Composition rules: How to combine primitives
  • Recursive application: Apply rules at any depth
💡 Key Insight

L2M essentially teaches the model a recursive algorithm through examples rather than a fixed pattern. This is why it generalizes to longer inputs—the algorithm works at any length.

When to Use Least-to-Most

  • Compositional tasks: Where the answer is built from smaller pieces
  • Length generalization: Test inputs longer than training
  • Hierarchical problems: Natural parent-child structure
  • Complex word problems: Multiple interdependent steps
  • Symbolic reasoning: Rule-based transformations

Comparison: CoT vs Least-to-Most

AspectChain-of-ThoughtLeast-to-Most
PassesSingle passTwo-pass (decompose + solve)
StructureImplicitExplicit decomposition
Context growthLinearGrows with each subproblem
Length gen.PoorExcellent
CostLowerHigher (multiple calls)
Best forSimple reasoningCompositional tasks

Implementation Considerations

  • Decomposition prompt: "To solve X, what subproblems need to be solved first?"
  • Ordering is critical: Must go from easy → hard, not arbitrary
  • Context accumulation: Each solution is added to context for next subproblem
  • Multiple API calls: One for decomposition, one per subproblem
  • Context length: Can grow large—may need truncation strategy

Limitations

When NOT to Use

Simple problems: Overhead isn't worth it. Non-compositional tasks: If the problem doesn't decompose naturally, L2M won't help. Latency-sensitive: Multiple API calls add delay. Ambiguous decomposition: If there's no clear subproblem structure, the decomposition stage may fail.

CoT w/o Prompting 📚 168 citations

Chain-of-Thought Reasoning Without Prompting

Wang & Zhou et al. • Google DeepMind • 2024

Discovers that CoT reasoning can emerge naturally without demonstrations by decoding with alternative top-k tokens. The model already "knows" how to reason—you just need to find it.

The Surprising Discovery: Models Already Know How to Reason

The standard assumption was that CoT prompting teaches the model to reason. This paper overturns that: the reasoning capability already exists in pretrained models—it's just not activated by default greedy decoding.

💡 The Key Insight

When you decode greedily (always pick the highest probability token), the model often jumps straight to an answer. But if you look at the 2nd, 3rd, or 10th most likely next token, you often find the start of a reasoning chain. The model "knows" it should reason—it just doesn't do it by default.

How It Works: CoT-Decoding

1
Ask a Question (No CoT Prompt)
Just ask "Q: What is 15 × 7?" without any "think step by step" instruction.
2
Explore Top-k First Tokens
Instead of greedy decode, look at the top k (e.g., 10) most likely first tokens. Some will be direct answers ("105"), others will be reasoning starters ("Let", "First", "15").
3
Continue Each Path
Decode each path to completion. The paths that start with reasoning tokens naturally generate CoT-style reasoning.
4
Select Based on Confidence
Answers generated via CoT paths tend to have higher confidence. Select the path where the model is most confident in the final answer.

What They Found in the Decoding Space

Q: "What is 15 × 7?"

Top-1: "105" (direct answer)
Top-2: "The" → "The answer is 105"
Top-3: "15" → "15 × 7 = 105"
Top-5: "Let" → "Let me calculate: 15 × 7 = 105" ✓
Top-8: "First" → "First, 15 × 7... = 105" ✓

The reasoning paths exist—they're just not the highest probability by default. Prompting with "think step by step" shifts probability mass toward these paths.

Key Experimental Results

  • CoT paths exist: ~40% of questions had a CoT path in top-10 alternatives
  • Correlation with correctness: When CoT path exists, the answer is more often correct
  • Higher confidence: Model assigns higher probability to final answer when CoT is present
  • No prompting needed: The same reasoning emerges without any "think step by step" instruction

Why This Matters: Deep Implications

💡 Implication 1: Reasoning is Emergent

Models learn to reason from pretraining data (math textbooks, Stack Overflow, etc.). CoT prompting doesn't teach reasoning—it activates existing capability.

💡 Implication 2: Decoding Strategy Matters

Greedy decoding may be leaving performance on the table. Alternative decoding strategies (beam search variants, top-k exploration) could unlock hidden capabilities.

💡 Implication 3: Explains Emergence

This may explain why CoT only works at scale: smaller models may not have learned reasoning from pretraining data, so there's no latent capability to activate.

Practical Applications

  • Confidence calibration: Presence of CoT in alternatives indicates the model is confident
  • Automatic CoT detection: Check if reasoning paths exist in top-k to decide if a question needs more careful handling
  • Better decoding: Can bias decoding toward reasoning tokens for improved accuracy
  • Research direction: Opens new research into decoding-time interventions

Limitations

  • Compute intensive: Exploring top-k requires k× more decoding
  • Not always present: CoT paths only exist ~40% of the time
  • Smaller models: Effect is weaker in smaller models (they may not have learned reasoning)
  • Complex questions: For very complex problems, even the CoT paths may be wrong

Action & Agents Line

ReAct 📚 5,886 citations

ReAct: Synergizing Reasoning and Acting in Language Models

Yao et al. • Princeton + Google • ICLR 2023

The foundation of modern AI agents. Interleaves reasoning traces with actions (search, lookup, etc.), allowing models to plan, act, observe, and revise. Thought → Action → Observation loops.

The Foundational Insight: Reasoning and Acting Must Be Interleaved

Before ReAct, there were two separate paradigms: Chain-of-Thought (pure reasoning, no actions) and Action-Only (just execute actions, no explicit reasoning). Both had limitations—CoT couldn't gather new information, and action-only was uninterpretable and error-prone.

💡 The Key Learning

Humans don't think-then-act or act-then-think—we interleave them continuously. "I need to find X" → search → "Found Y, not X" → revise plan → search again. ReAct captures this interleaved process, making agents both more capable and more interpretable.

The ReAct Loop: Thought → Action → Observation

T
Thought
The model reasons about what to do next: "I need to find when Obama was born." This is explicit internal reasoning.
A
Action
The model chooses an action from the available toolkit: Search[Barack Obama], Lookup[born], Calculate[2024-1961], etc.
O
Observation
The environment returns a result: "Barack Obama was born August 4, 1961..." This feeds back into the next Thought.
Example ReAct Trace Question: What is the elevation range for the area that the eastern sector of the Colorado orogeny extends into? Thought 1: I need to search Colorado orogeny to find the area that the eastern sector extends into. Action 1: Search[Colorado orogeny] Observation 1: The Colorado orogeny was an episode of mountain building... The eastern sector extends into the High Plains. Thought 2: The eastern sector extends into the High Plains. I need to search High Plains and find its elevation range. Action 2: Search[High Plains] Observation 2: High Plains refers to one of two distinct land regions: the High Plains in the United States...elevation range of 1,800 to 7,000 ft. Thought 3: High Plains rise in elevation from around 1,800 to 7,000 ft, so the answer is 1,800 to 7,000 ft. Action 3: Finish[1,800 to 7,000 ft]

Benchmark Results

BenchmarkCoT OnlyAct OnlyReAct
HotpotQA (QA)28%25%34%
FEVER (fact-check)56%58%64%
ALFWorld (games)N/A45%71%
WebShop (shopping)N/A30%40%

Why Interleaving Works

  • Reasoning grounds actions: Thoughts prevent random or hallucinated actions
  • Actions ground reasoning: Real observations prevent hallucinated facts
  • Error recovery: When observations don't match expectations, thoughts can revise the plan
  • Interpretability: The thought trace shows exactly why each action was taken
💡 The Synergy

CoT alone hallucinates facts it doesn't know. Act-only makes random mistakes with no way to recover. ReAct combines the planning of CoT with the grounding of actions—each compensates for the other's weakness.

Standard Action Space

# ReAct's original action space for QA tasks: Search[query] # Search Wikipedia for a topic Lookup[keyword] # Find keyword in current page Finish[answer] # Complete task with final answer # Extended action spaces in practice: Calculate[expr] # Evaluate mathematical expression Code[python] # Execute Python code Browse[url] # Navigate to a URL API[endpoint, params] # Call external API

The Modern Impact: Every Agent Framework Uses This

  • LangChain Agents: Direct implementation of ReAct loop
  • AutoGPT/BabyAGI: Extended ReAct with memory and task decomposition
  • OpenAI Function Calling: Structured version of Action step
  • Claude Tools: Same pattern—interleave reasoning with tool use
  • Microsoft Copilot: ReAct-style reasoning for code actions

Implementation Pattern

def react_loop(question, tools, max_steps=10): context = f"Question: {question}\n" for step in range(max_steps): # Generate Thought + Action response = llm.generate(context + "Thought:") thought, action = parse_thought_action(response) context += f"Thought {step}: {thought}\nAction {step}: {action}\n" # Execute action and get observation if action.startswith("Finish"): return extract_answer(action) observation = tools.execute(action) context += f"Observation {step}: {observation}\n" return "Failed to find answer"

Common Failure Modes

  • Infinite loops: Agent keeps searching without finishing (need max steps)
  • Hallucinated actions: Agent calls tools that don't exist (need validation)
  • Ignored observations: Agent ignores what it finds and proceeds with prior beliefs
  • Premature finish: Agent finishes before gathering enough info
  • Overthinking: Too many thoughts without action slows progress
Production Tip

Always set max_steps and validate action format. Most failures come from agents getting stuck in loops or calling invalid actions. Structured output (JSON) for actions helps reliability.

ReAct vs Alternatives

AspectCoT OnlyAct OnlyReAct
Can use tools?❌ No✓ Yes✓ Yes
Interpretable?✓ Yes❌ No✓ Yes
Error recovery?❌ No❌ No✓ Yes
Hallucination riskHighLowLow
Token costLowLowMedium
LatencyLowMediumHigh
Reflexion 📚 2,855 citations

Reflexion: Language Agents with Verbal Reinforcement Learning

Shinn et al. • Northeastern + MIT • NeurIPS 2023

Agents learn from failure by reflecting in natural language. After a failed attempt, the agent generates self-critique and stores it as memory for future tries. Verbal RL without weight updates.

The Key Insight: Learning from Failure in Natural Language

Traditional RL uses scalar rewards (success/failure, +1/-1) to update model weights. But LLMs can do something remarkable: they can reflect on why they failed in natural language and use that as context for the next attempt—no weight updates needed.

💡 The Breakthrough

Instead of training: "Trial 1 failed (reward=-1)", Reflexion does: "Trial 1 failed because I forgot to check the edge case where n=0. Next time, I should add explicit handling for empty inputs." This rich linguistic signal is far more informative than a scalar reward.

The Reflexion Loop

1
Act: Make an Attempt
The agent tries to complete the task (write code, navigate environment, answer question).
2
Evaluate: Get Feedback
Run tests, check answer, or get environment feedback. Binary (pass/fail) or detailed error messages.
3
Reflect: Analyze Failure
Generate a natural language reflection: "What went wrong? Why? What should I do differently?"
4
Store: Add to Memory
Store the reflection in episodic memory. This persists across attempts and tasks.
5
Retry: Attempt with Context
Try again with reflections in context. The agent now "remembers" what not to do.

Example: Coding Task with Reflexion

Trial 1 Task: Write a function to find the longest palindromic substring. def longest_palindrome(s): # My attempt for length in range(len(s), 0, -1): for i in range(len(s) - length + 1): substr = s[i:i+length] if substr == substr[::-1]: return substr Test Result: FAILED - Test case "" expected "", got None Reflection I failed because my function doesn't handle the empty string case. When s is empty, the for loop doesn't execute and None is returned implicitly. I should add a check for empty input at the start. Trial 2 (with reflection in context) def longest_palindrome(s): if not s: # Handle empty string return "" for length in range(len(s), 0, -1): ... Test Result: PASSED ✓

Benchmark Results

BenchmarkWithout ReflexionWith ReflexionGain
HumanEval (code)67.0%91.0%+24.0%
MBPP (code)70.0%77.1%+7.1%
ALFWorld (games)75%97%+22%
HotpotQA (QA)35%49%+14%
WebShop (shopping)40%59%+19%
The HumanEval Result

91% pass@1 on HumanEval with GPT-4 was state-of-the-art at publication—achieved purely through reflection, no fine-tuning. This shows how much performance is left on the table without iterative refinement.

Memory Architecture

  • Short-term memory: Current task trajectory (actions taken, observations received)
  • Episodic memory: Accumulated reflections from past failures
  • Semantic memory: General knowledge (comes from LLM pretraining)
💡 Design Insight

Reflections are stored as natural language "lessons learned." Unlike embeddings or weights, they're human-readable and can be curated/edited. You can even manually add reflections to bootstrap learning.

Why It's "Verbal RL"

Reflexion mirrors key RL concepts but in language:

RL ConceptReflexion Equivalent
Reward signalSuccess/failure feedback
Policy gradientNatural language reflection
Experience replayEpisodic memory of reflections
ExplorationDifferent approaches in retries
Weight updatesContext updates (no training)

Implementation Tips

  • Reflection prompt: "Given the error/feedback above, what went wrong and how should I approach this differently?"
  • Memory management: Keep most recent N reflections, summarize older ones
  • Max retries: 2-3 iterations usually sufficient; more rarely helps
  • Task-specific feedback: Code tasks get test output; QA gets ground truth comparison
  • Reflection quality: More specific reflections work better than generic ones

Limitations

  • Need for feedback: Requires some evaluation signal (tests, oracle, etc.)
  • Systematic errors: If model doesn't know the right approach, reflection won't help
  • Compute cost: Multiple attempts per task increases cost
  • Latency: Sequential retries add time
  • Quality ceiling: Eventually converges—more retries don't always help

Reflexion vs Other Self-Improvement Methods

MethodRequires Training?Multi-attempt?Memory?Best For
Self-ConsistencyParallelAnswer selection
Self-RefineSequentialOutput polish
ReflexionSequentialLearning from failure
RLHFN/AWeightsGeneral improvement

Tree & Search Line

ToT 📚 4,856 citations

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Yao et al. • Princeton + Google DeepMind • NeurIPS 2023

Generalizes CoT to tree search. Generate multiple "thoughts," evaluate them, and explore/backtrack. Enables deliberate planning and lookahead that linear CoT cannot achieve.

The Key Insight: Reasoning as Search, Not a Path

Chain-of-Thought generates a single linear path through the reasoning space. But many problems require exploration—you might go down a wrong path and need to backtrack. Tree of Thoughts models reasoning as search over a tree, enabling the model to explore multiple branches and back up from dead ends.

💡 The Breakthrough

Some problems have no clear solution path until you've tried a few approaches. ToT lets the model say "let me try approach A... that didn't work... let me try approach B" instead of committing to one path. This is how humans solve puzzles.

The Four Components of ToT

1
Thought Decomposition
Define what a "thought" is for this problem—a line of code, a sentence, a partial solution, an equation. The granularity matters: too small = too many branches, too large = loses the benefit.
2
Thought Generator
Generate k candidate next-thoughts from the current state. Either sample (generate k different outputs) or propose (ask LLM to suggest k options in one call).
3
State Evaluator
Use the LLM to assess how promising each state is: "Can this partial solution lead to success?" Returns a score or categorical rating (sure/maybe/impossible).
4
Search Algorithm
Use BFS (breadth-first: explore level by level) or DFS (depth-first: go deep then backtrack) to navigate the tree. Which to use depends on the problem structure.

Example: Game of 24

Problem Use numbers 4, 5, 6, 10 with +, -, *, / to make 24. Each number used exactly once. Tree Search Root: {4, 5, 6, 10} Level 1 - Generate operations: ├── 10 - 6 = 4 → {4, 4, 5} [Evaluator: "maybe - two 4s could work"] ├── 10 - 4 = 6 → {5, 6, 6} [Evaluator: "maybe"] ├── 10 + 6 = 16 → {4, 5, 16} [Evaluator: "sure - 16 + 4 + 5 - 1 close"] ├── 5 * 4 = 20 → {6, 10, 20} [Evaluator: "impossible - too big"] ✗ PRUNE └── ... Level 2 - Continue from promising nodes: From {4, 4, 5}: ├── 4 + 4 = 8 → {5, 8} │ └── 5 * 8 = 40 ✗ ├── 4 * 4 = 16 → {5, 16} │ └── ... └── 5 - 4 = 1 → {1, 4} └── ... nope, backtrack Eventually find: (10 - 4) * (6 - 5 + 1)... or another valid path ✓

Benchmark Results

TaskCoTToT (BFS)Gain
Game of 244.0%74.0%+70%
Creative Writing6.27.6+1.4
Mini Crosswords15.6%35.4%+20%
The Game of 24 Result

CoT achieves only 4% on Game of 24 because it commits to operations that might be wrong. ToT achieves 74% by exploring multiple operation sequences and backtracking. This is exactly the type of problem where search beats linear reasoning.

BFS vs DFS: When to Use Which

StrategyHow It WorksBest For
BFSExplore all level-n nodes before level-n+1When solutions are likely at similar depths
DFSGo deep on one path, backtrack if stuckWhen you need to see full solutions to evaluate

Game of 24: Uses BFS because partial equations can be evaluated.
Creative Writing: Uses DFS because you need full paragraphs to judge coherence.

The Evaluator Prompt

# Example evaluator prompt for Game of 24: Evaluate if given numbers can reach 24 (sure/maybe/impossible). 10 14 10 + 14 = 24 sure 11 12 11 + 12 = 23, 11 * 12 = 132, 12 - 11 = 1, 11 / 12 = 0.91 impossible 4 4 10 4 + 4 + 10 = 18, 4 * 4 - 10 = 6, (10 - 4) * 4 = 24 sure {input_numbers}

When to Use ToT

  • Combinatorial puzzles: Game of 24, Sudoku, crosswords
  • Creative generation: Where multiple paths could work
  • Planning problems: Route finding, scheduling
  • Hard reasoning: When first attempt likely fails
  • NOT for: Simple QA, factual questions, or problems with clear solution paths

Cost-Benefit Analysis

  • Compute: 10-100x more LLM calls than single CoT
  • Latency: Sequential exploration adds significant time
  • When worth it: Hard problems where CoT fails, high-stakes accuracy
  • When not worth it: Simple problems, latency-critical, budget-constrained
💡 Practical Guidance

Try CoT first. If it fails consistently on a problem type, consider ToT. For production, use ToT selectively on the hardest queries only—use a classifier to decide.

LLM-Guided ToT 📚 314 citations

Large Language Model Guided Tree-of-Thought

Long et al. • 2023

Uses an LLM to guide the tree exploration itself, rather than fixed BFS/DFS. The LLM learns which branches to explore, making search more efficient and targeted.

Key Improvements

  • LLM decides which node to expand next
  • More efficient than blind BFS/DFS
  • Learns from evaluation history
  • Adaptive search depth based on problem

Core Technique

Instead of fixed search strategy, prompt the LLM with the current tree state and ask: "Which thought should we explore next?" The LLM acts as both the thought generator and the search controller.

Program-Based Line

PoT 📚 1,083 citations

Program of Thoughts Prompting: Disentangling Computation from Reasoning

Chen et al. • CMU + Salesforce • 2022

Instead of reasoning in natural language, generate Python code that performs the computation. Execute the code to get the answer. Separates "what to compute" from "how to compute."

The Key Insight: Let the Computer Do the Computing

CoT asks the LLM to both reason about what to compute AND perform the computation. But LLMs are bad at arithmetic—they make mistakes on "what is 37 × 89?". PoT separates these: the LLM writes code describing the computation, then Python executes it exactly.

💡 The Breakthrough

LLMs are reasoning engines, not calculators. When you ask an LLM "what is 127 × 38 + 94?", it's doing pattern matching—and sometimes gets it wrong. But when you ask "write code to compute this", it generates code and Python gets it exactly right every time.

CoT vs PoT: Side by Side

Chain-of-Thought Q: Janet pays $40/hour. She works 8 hours on Mon, Tue, Wed, and 6 hours on Thu, Fri. Weekly pay? A: Mon-Wed: 40 × 8 = 320 per day Three days: 320 × 3 = 960 Thu-Fri: 40 × 6 = 240 per day Two days: 240 × 2 = 480 Total: 960 + 480 = 1440 ❌ Manual math can have errors
Program of Thoughts hourly = 40 mon_wed = hourly * 8 * 3 thu_fri = hourly * 6 * 2 total = mon_wed + thu_fri print(total) # 1440 ✓ Python is always exact

Benchmark Results

BenchmarkCoTPoTGain
GSM8K (math)58.0%80.0%+22%
AQuA (algebra)35.8%58.0%+22%
SVAMP (math)78.2%88.0%+10%
TabMWP (tables)62.0%76.8%+15%

When PoT Excels

  • Multi-step arithmetic: Compound interest, tax calculations
  • Iterative problems: "After N days...", simulation
  • Table/data operations: Filtering, aggregating
  • NOT for: Commonsense reasoning, strategy problems
The Pattern

Use PoT whenever the answer requires computation that Python can execute. If the answer is a number from calculation, PoT beats CoT.

Limitations

  • Execution needed: Must run Python safely (sandbox)
  • Code errors: Syntax/runtime errors still happen
  • Logic errors: Correct math on wrong formula = wrong answer
  • Non-computational tasks: Can't help with reasoning-only problems

2025 Survey Papers: The State of the Art

System 1→2 📚 2025 Survey

From System 1 to System 2: A Survey of Reasoning Large Language Models

Li et al. • Feb 2025

Comprehensive survey covering the transition from "fast thinking" (System 1) to "slow, deliberate reasoning" (System 2) in LLMs. Covers construction methods, benchmarks, and open problems.

The Core Framework: System 1 vs System 2

This survey borrows Daniel Kahneman's dual-process theory to explain LLM evolution:

System 1 (Fast Thinking) • Pattern matching on training data • Single forward pass per token • "What's 2+2?" → "4" (instant) • Prone to hallucination on novel tasks • GPT-4, Claude 3.5 in standard mode
System 2 (Slow Thinking) • Deliberate step-by-step reasoning • Many tokens of "thinking" before answer • "What's 127×38?" → [long reasoning] → "4826" • More accurate on complex reasoning • o1, o3, R1, Claude in thinking mode

The Progression: How We Got to Reasoning Models

1
Prompt Engineering Era (2022)
CoT, Self-Consistency, Least-to-Most. External prompts unlock reasoning without changing the model.
2
Supervised Fine-Tuning (2023)
Train on reasoning traces. Models learn to generate CoT by default without prompting.
3
Reinforcement Learning (2024)
RL from outcome feedback. o1 and R1 learn to reason through trial-and-error, not just imitation.
4
Test-Time Scaling (2024-25)
More thinking tokens at inference = better answers. Trading compute for capability.

Key Insight for Practitioners

💡 The Trade-off

System 2 reasoning costs 10-100x more tokens but solves problems System 1 cannot. The skill is knowing when to use each—don't pay System 2 costs for System 1 problems.

Survey Coverage

  • 150+ papers systematically reviewed
  • Construction methods: prompting → SFT → RL → hybrid
  • Benchmarks: GSM8K, MATH, ARC, coding tasks
  • Open problems: efficiency, verification, generalization
LRM Survey 📚 2025 Survey

Towards Large Reasoning Models: A Survey of Reinforced Test-Time Scaling

Xu et al. • Jan 2025

Focuses on the transition from LLMs to "Large Reasoning Models" through test-time scaling. Covers o1-style long CoT, automated data construction, and RL-based learning-to-reason.

The New Paradigm: LLM → LRM

This survey defines the transition from Large Language Models to Large Reasoning Models—models specifically trained to "think" before answering:

AspectLLM (GPT-4, Claude)LRM (o1, R1)
Output styleDirect answerThinking trace → answer
Tokens per response100s-1000s10,000s-100,000s
TrainingNext-token predictionRL on reasoning outcomes
Scaling axisParameters (train-time)Thinking tokens (test-time)
Best forCreative, conversationalMath, logic, coding

The Test-Time Scaling Revolution

Traditional scaling: "Bigger model = better results" (train-time compute). The LRM insight:

💡 Test-Time Scaling

More thinking tokens at inference = better reasoning, even with the same model. o1 can "think longer" on hard problems. This shifts the cost from training to inference—you pay per query, not per model.

How LRMs Are Built

1
Data: Long CoT traces
Generate or curate reasoning traces with verification steps
2
Training: RL from outcomes
Reward correct final answers, let model discover good reasoning
3
Inference: Variable compute
Harder problems get more thinking tokens automatically

Practical Implications

  • Cost model changes: Pay per reasoning step, not just per token
  • Latency increases: Expect 10-60 seconds for complex reasoning
  • When to use: Math, coding, logic puzzles—NOT chat or creativity
  • DIY option: Fine-tune on reasoning traces + RL (see DeepSeek R1)
Efficiency 📚 2025 Survey

Efficient Reasoning Models: A Survey

Feng et al. • Apr 2025

Addresses the efficiency crisis in reasoning models. Organizes solutions into three axes: Shorter (compressed CoT), Smaller (distilled models), Faster (accelerated decoding).

The Efficiency Crisis

Reasoning models achieve breakthrough accuracy but at extreme cost:

MetricStandard LLMReasoning Model (o1)Impact
Tokens per query~500~50,000+100x cost
Latency1-3 seconds30-120 secondsUX impact
Cost per query$0.001-0.01$0.10-1.00$$ scaling
GPU memoryModerateLong context neededInfra cost

The Three Efficiency Axes

📏 SHORTER Compress reasoning length • Concise CoT training • Reasoning distillation • Skip unnecessary steps • "Say less, reason better" Goal: Same accuracy, fewer tokens
📦 SMALLER Smaller model, same capability • Knowledge distillation • Reasoning transfer learning • Model compression + RL • 7B matching 70B Goal: Commodity GPU deployment
⚡ FASTER Accelerate inference • Speculative decoding • Parallel reasoning paths • Early exit strategies • KV cache optimization Goal: Real-time reasoning

Key Techniques

  • Distillation: Train small model on big model's reasoning traces (DeepSeek-R1-Distill)
  • Adaptive compute: Easy problems → short thinking; Hard → long thinking
  • Compressed CoT: Train to reason in fewer tokens without accuracy loss
  • Speculative decoding: Draft tokens with small model, verify with large

Practical Takeaway

💡 The 10x Rule

For production deployment, you need reasoning efficiency to improve 10x from current state-of-art. Watch distillation research closely—7B distilled models are already matching 70B on many benchmarks.

Logical 📚 2025 Survey

Logical Reasoning in Large Language Models: A Survey

Liu et al. • Feb 2025

Deep dive into formal logical reasoning: deductive, inductive, abductive, and analogical. Covers neuro-symbolic methods, RL approaches, and why LLMs still struggle with rigorous logic.

The Four Types of Logical Reasoning

TypeDirectionExampleLLM Performance
DeductiveGeneral → Specific"All humans are mortal. Socrates is human. ∴ Socrates is mortal."Good with simple chains, fails on complex
InductiveSpecific → General"Swan 1 is white, Swan 2 is white... ∴ All swans are white"Prone to overgeneralization
AbductiveEffect → Best Cause"The grass is wet. Best explanation: It rained."Reasonable but not rigorous
AnalogicalDomain A → Domain B"Atom is like solar system: nucleus = sun, electrons = planets"Creative but often superficial

Why LLMs Struggle with Formal Logic

💡 The Core Problem

LLMs do soft pattern matching, not hard symbolic manipulation. They've seen "Socrates is mortal" in training data, so they pattern-match to the right answer—but they can't reliably apply modus ponens to novel entities. The survey shows LLMs fail on negation ("not"), quantifiers ("all", "some"), and multi-hop reasoning.

Enhancement Approaches

  • Data-centric: Fine-tune on formal logic datasets (ProofWriter, FOLIO)
  • Neuro-symbolic: LLM generates logical forms → external solver executes
  • RL for logic: Reward valid inference chains
  • Constrained decoding: Force outputs to follow logical grammar

Practical Implications

  • Don't trust LLMs for formal proofs without verification
  • Use hybrid systems for logic-critical applications
  • Test edge cases: Negation, quantifier scope, contradiction
  • Best results: LLM proposes, symbolic system verifies
Trustworthy 📚 2025 Survey

A Comprehensive Survey on Trustworthiness in Reasoning Models and Chain-of-Thought

Wang et al. • 2025

Examines reasoning through the lens of trustworthiness: truthfulness, safety, robustness, fairness, and privacy. Critical for deploying reasoning systems in production.

The Five Pillars of Trustworthy Reasoning

PillarDefinitionFailure ExampleMitigation
TruthfulnessReasoning reflects realityConfident but wrong math stepsSelf-consistency, verification
SafetyNo harmful outputsReasoning toward dangerous instructionsOutput filtering, RLHF
RobustnessWorks under attackJailbreaks via reasoning injectionAdversarial training
FairnessUnbiased reasoningDifferent conclusions for similar casesDebiasing, auditing
PrivacyNo data leakageReasoning reveals training examplesDifferential privacy

The Amplification Problem

💡 Why Long Reasoning is Riskier

Extended reasoning (o1-style) amplifies both capabilities and risks. Each reasoning step is an opportunity for: (1) accumulating errors, (2) introducing biases, (3) generating unsafe content, (4) leaking private information. A 100-step reasoning chain has 100x the attack surface of a direct answer.

Key Findings

  • Faithfulness gap: CoT often doesn't reflect actual model computation
  • Sycophancy risk: Reasoning can be steered by user preferences
  • Jailbreak vectors: Reasoning steps can be exploited for attacks
  • Consistency issues: Same problem, different reasoning, different answers

Production Checklist

  • ✓ Verify reasoning chains independently of final answer
  • ✓ Monitor for reasoning drift over conversation
  • ✓ Test with adversarial inputs before deployment
  • ✓ Audit for demographic bias in reasoning paths
  • ✓ Implement early termination for suspicious reasoning
Implicit 📚 2025 Survey

Implicit Reasoning in Large Language Models

Li et al. • 2025

Can models reason without generating explicit CoT? This survey explores "implicit reasoning" via latent structures, trading off interpretability for efficiency.

The Core Trade-off: Explicit vs Implicit Reasoning

Explicit Reasoning (CoT) Reasoning visible in output tokens Q: What is 23 × 17? A: Let me think step by step... 23 × 17 = 23 × (10 + 7) = 230 + 161 = 391 ✓ Interpretable, debuggable ✗ Slow (100s-1000s tokens) ✗ Expensive ($$$)
Implicit Reasoning Reasoning in hidden states Q: What is 23 × 17? A: 391 (Reasoning happened in the neural network activations, not visible to users) ✓ Fast (minimal tokens) ✓ Cheap ✗ Opaque, unverifiable ✗ Hard to debug failures

The Research Frontier

💡 The Best of Both Worlds?

Can we train models that reason implicitly (fast, cheap) but achieve explicit reasoning accuracy? Early research shows promise: "pause tokens" let models think without visible output, latent reasoning heads show computation happening in hidden states. The goal: o1-level reasoning at GPT-4 speed.

Approaches Surveyed

  • Pause tokens: Special tokens that give model "thinking time" without output
  • Latent reasoning: Train hidden states to perform computation
  • Compressed CoT: Distill long reasoning into internal representations
  • Recurrent depth: Loop through layers multiple times for harder problems

Why This Matters

  • Cost: Implicit reasoning could reduce inference cost 100x
  • Latency: Real-time reasoning applications become possible
  • Verification challenge: How do you trust reasoning you can't see?
  • Human parallel: We don't verbalize every thought—maybe AI shouldn't either