The Evolution of LLM Reasoning

Chain-of-Thought

Action & Agents

Tree & Search

Program-Based

Surveys & Meta

🚀 The Frontier: What's Next

The reasoning landscape is evolving rapidly. Here are the key trends shaping 2025 and beyond.

⚡

Test-Time Compute Scaling

The new scaling law: more thinking tokens = better results. o1/o3 and R1 prove you can trade inference cost for capability. The frontier is now efficiency—getting the same gains with fewer tokens.

o1 vs Gemini on same task: 138K vs 96 tokens

🔄

Verify-then-Generate

The winning pattern: generate candidates, verify with a separate model/tool, iterate. External verifiers (code execution, formal proofs, unit tests) unlock gains independent of base model quality.

Reflexion on HumanEval: +24% with self-verification

🧬

Reasoning Distillation

Large reasoning models (o1, R1) generate training data for smaller models. DeepSeek-R1-Distill-Qwen-7B matches GPT-4o on math. The moat isn't the model—it's the reasoning traces.

7B distilled vs 70B base: Often comparable

🎯

Adaptive Compute

Not all problems need the same thinking budget. The next frontier: models that know when to think longer. Easy questions get fast answers; hard ones get extended reasoning.

Goal: Right-size thinking per query

🔗

Tool-Augmented Reasoning

Pure language reasoning hits limits. The winners combine LLM reasoning with code execution (PoT), search (ReAct), and formal verification. Hybrid systems dominate benchmarks.

PoT vs CoT on GSM8K: +22% with code execution

📉

The Efficiency Crisis

"Slow thinking" generates massive token counts. Research explores: compressed CoT (say less), distillation (smaller models), speculative decoding (faster generation). Cost per reasoning step must drop 10-100x.

Current challenge: $0.10+ per complex query

The Bottom Line: We've proven LLMs can reason. The 2025 frontier is making that reasoning efficient, reliable, and verifiable. The techniques on this page are the building blocks.

🎯 The Decision Framework: Which Technique Should You Use?

Don't just learn these techniques—know when to use each one. This decision tree distills 10 papers into actionable guidance.

What type of problem are you solving?

Math/Computation

Does it require calculation?

Yes

→ Program of Thoughts

Let Python do the math. +22% on GSM8K.

No, just reasoning

→ Chain-of-Thought

Step-by-step natural language reasoning.

Needs External Info

Can model hallucinate facts?

No, need real data

→ ReAct

Interleave reasoning with tool calls.

Can self-correct

→ Reflexion

Try, fail, reflect, retry. +24% on HumanEval.

Puzzles/Planning

Might first approach be wrong?

Yes, need backtracking

→ Tree of Thoughts

Explore multiple paths. 4% → 74% on Game of 24.

No, path is clear

→ Chain-of-Thought

Single path reasoning is enough.

Complex/Compositional

→ Least-to-Most

Decompose, solve easy→hard. 99.7% on SCAN.

⚡ The Quick Reference Matrix

Technique	Best For	Cost	Key Gain	Don't Use When
CoT	Multi-step reasoning	1x	+40% GSM8K	Simple lookups, pattern matching
Self-Consistency	High-stakes accuracy	10-40x	+16% over CoT	Budget-constrained, latency-critical
Least-to-Most	Compositional tasks	2-5x	+84% SCAN	Non-decomposable problems
ReAct	Tool use, grounding	3-10x	+10% FEVER	Closed-book reasoning
Reflexion	Iterative improvement	2-5x	+24% HumanEval	No feedback signal available
Tree of Thoughts	Search/exploration	10-100x	+70% Game24	Simple problems, cost matters
Program of Thoughts	Math-heavy tasks	1x + exec	+22% GSM8K	Non-computational reasoning

🚨 The Mistakes Everyone Makes

❌ Using CoT for everything

The mistake: Applying "think step by step" to simple factual questions.

Why it's wrong: CoT adds latency and can confabulate reasoning for simple lookups. "What's the capital of France?" doesn't need step-by-step.

Fix: Classify queries first. Use CoT only for multi-step reasoning.

❌ ToT on simple problems

The mistake: Using Tree of Thoughts for problems where CoT suffices.

Why it's wrong: ToT is 10-100x more expensive. If CoT works, you're burning money.

Fix: Try CoT first. Only use ToT when CoT consistently fails.

❌ Reflexion without feedback

The mistake: Trying to use Reflexion without a way to evaluate attempts.

Why it's wrong: Reflexion needs signal. No feedback = no learning = just burning tokens.

Fix: Ensure you have tests, verifiers, or ground truth before using Reflexion.

❌ Self-Consistency with temp=0

The mistake: Running Self-Consistency with greedy decoding.

Why it's wrong: Temperature=0 produces identical outputs. You're just generating the same answer N times.

Fix: Use temperature 0.5-0.7 for diversity.

❌ PoT for non-computational tasks

The mistake: Generating code for commonsense reasoning questions.

Why it's wrong: "Is a penguin a bird?" doesn't benefit from Python. The model will generate silly code.

Fix: Only use PoT when there's actual computation to offload.

❌ No max_steps in ReAct

The mistake: Running ReAct agents without step limits.

Why it's wrong: Agents can loop forever, searching repeatedly without finishing.

Fix: Always set max_steps (5-10 for most tasks).

🏭 Production Playbook

Start simple, scale complexity

Begin with basic prompting → add CoT if accuracy is low → add Self-Consistency for high-stakes → consider ToT only for genuinely hard problems. Most production systems don't need ToT.

Build a query classifier

Not all queries need the same treatment. A lightweight classifier (or even a regex) can route simple queries to fast paths and complex queries to expensive techniques.

Combine techniques strategically

PoT + Self-Consistency: Generate multiple code solutions, vote on outputs. ReAct + Reflexion: Retry failed tool calls with reflection. The techniques compose.

Measure what matters

Track: accuracy, latency (p50/p95/p99), cost per query, failure modes. A technique that's +5% accuracy but +1000% cost may not be worth it.

Chain-of-Thought Line

CoT 📚 17,700+ citations

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Wei et al. (inc. Denny Zhou) • Google Brain • NeurIPS 2022

The foundational paper that started it all. Adding few-shot examples with explicit step-by-step reasoning dramatically boosts performance on arithmetic, symbolic, and commonsense reasoning.

The Core Idea: Why This Paper Changed Everything

Before CoT, the standard approach was direct prompting: "Q: What is 23 × 17? A: 391". The model had to make a single forward pass and hope it had "memorized" enough similar patterns. This paper showed that by asking the model to show its work, you unlock reasoning capabilities that simply don't exist in direct prompting.

💡 Key Learning

LLMs don't actually "reason" in a single forward pass—they pattern match. CoT works because it breaks complex problems into simpler pattern-matching steps that the model can do in one pass. Each intermediate step is easier than the whole problem.

How It Works: Step by Step

Provide Few-Shot Examples with Reasoning

Instead of just showing input→output pairs, show input→reasoning→output. The model learns the pattern of thinking through problems.

Model Generates Intermediate Steps

When given a new problem, the model follows the demonstrated pattern and generates its own reasoning chain before the answer.

Each Step Becomes Context for the Next

As the model generates each reasoning step, that text becomes part of the context for generating the next step—effectively "working memory."

Final Answer Follows from Reasoning

The answer is now a simple extraction from the reasoning chain, not a complex leap from the original problem.

Example Prompt: Standard vs Chain-of-Thought

❌ Standard Prompting Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? A: 11 Q: A juggler has 16 balls. Half are golf balls, and half of the golf balls are blue. How many blue golf balls are there? A:

✓ Chain-of-Thought Prompting Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11. Q: A juggler has 16 balls. Half are golf balls, and half of the golf balls are blue. How many blue golf balls are there? A:

Benchmark Results

Benchmark	Standard	CoT	Gain
GSM8K (math)	17.9%	58.0%	+40.1%
SVAMP (math)	58.8%	78.2%	+19.4%
ASDiv (math)	71.3%	78.6%	+7.3%
AQuA (algebra)	25.2%	35.8%	+10.6%
StrategyQA	65.4%	73.2%	+7.8%

Why It Works: The Theoretical Insight

Transformers are inherently parallel architectures—they process all tokens simultaneously. This means they can only compute functions in the complexity class TC⁰ (constant depth circuits).

Many problems (like multi-digit arithmetic) are inherently serial—you must compute intermediate values before the final answer. CoT effectively gives the transformer "extra layers" through the sequential generation of tokens.

The Deep Insight

CoT length should match the problem's "serial depth." A 10-step calculation needs ~10 reasoning steps. This explains why CoT helps arithmetic but barely affects pattern matching.

Key Discoveries

Emergent ability: CoT only helps models ≥100B parameters. Smaller models generate plausible-looking but wrong reasoning.
8 examples suffice: More examples don't help much after 8; the model gets the pattern.
Quality over quantity: Better reasoning examples matter more than more examples.
Zero-shot works: Just adding "Let's think step by step" works (but worse than few-shot).
Errors cascade: One wrong step usually means wrong final answer.

Limitations & When NOT to Use

Simple lookups: "What is the capital of France?" doesn't need reasoning.
Pattern matching: Tasks solvable by retrieval don't benefit.
Small models: Below 60B parameters, CoT often hurts performance.
Time-sensitive apps: Generating reasoning adds latency and cost.
Factual questions: CoT can confabulate convincing-but-wrong reasoning.

Practical Implementation Tips

🎯 Tip 1: Match Step Granularity

Your example reasoning should be at the right granularity. Too coarse = model skips important steps. Too fine = unnecessary verbosity and cost.

🎯 Tip 2: Include Diverse Examples

Cover different reasoning patterns in your examples. If all examples use the same approach, the model won't generalize well.

🎯 Tip 3: Verify the Reasoning

A correct final answer doesn't mean correct reasoning. Check intermediate steps—bad reasoning can get lucky.

arXiv Paper PDF Google AI Blog Video Explanation

Self-Consistency 📚 4,200+ citations

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Wang et al. (inc. Denny Zhou) • Google Research • ICLR 2023

Instead of greedy decoding, sample multiple diverse reasoning paths and take a majority vote. Turns CoT into an ensemble method with significant accuracy gains.

The Core Insight: Wisdom of Crowds for Reasoning

Self-Consistency is beautifully simple: correct reasoning paths are more likely to agree on the answer than incorrect ones. If you ask a model to solve a problem multiple times with some randomness, the correct answer should appear more often than any particular wrong answer.

💡 Key Learning

This technique turns CoT into an ensemble method. Different reasoning paths are like different "voters"—errors are random and cancel out, while the truth converges. It's the same principle behind random forests and ensemble learning.

How It Works: The Algorithm

Sample Multiple Reasoning Paths

Instead of greedy decoding (temperature=0), use temperature 0.5-0.7 to generate diverse reasoning chains. Sample N times (typically 5-40).

Extract Final Answers

Parse the final answer from each reasoning chain. The reasoning paths may differ wildly, but we only care about the final answer.

Majority Vote

Return the most frequent answer. Ties can be broken randomly or by taking the first occurrence.

Problem → Path 1: ... → 42
Problem → Path 2: ... → 42
Problem → Path 3: ... → 37
Problem → Path 4: ... → 42
────────────────────────────────
Majority Vote: 42 (3/4)

Benchmark Results

Benchmark	CoT	Self-Cons.	Gain
GSM8K	58.0%	74.4%	+16.4%
SVAMP	78.2%	86.6%	+8.4%
AQuA	35.8%	48.0%	+12.2%
StrategyQA	73.2%	79.1%	+5.9%
ARC-c	85.2%	88.7%	+3.5%

Why It Works: The Math

Assume the model has probability p of generating a correct reasoning path. With majority voting over n samples, the probability of a correct final answer approaches 1 as n increases (if p > 0.5).

The Statistics

If each sample has 60% accuracy independently, 5 samples give ~68% accuracy, 10 samples give ~75%, and 40 samples give ~83%. This is the Condorcet jury theorem applied to reasoning.

Crucially, different reasoning paths make different mistakes. Path A might misread a number, Path B might use wrong formula, but the correct answer emerges from the noise.

The Scaling Curve

Performance scales logarithmically with samples:

1 sample: Baseline CoT accuracy
5 samples: ~70% of potential gain
10 samples: ~85% of potential gain
40 samples: ~95% of potential gain
100+ samples: Diminishing returns

💡 Practical Insight

For production, 5-10 samples is the sweet spot—most of the accuracy gain with manageable cost. Use 40+ only when accuracy is critical and cost isn't.

Implementation Details

def self_consistency(prompt, n_samples=10, temp=0.7):
    answers = []
    for _ in range(n_samples):
        response = llm.generate(prompt, temperature=temp)
        answer = extract_answer(response)
        answers.append(answer)

    # Majority vote
    from collections import Counter
    vote = Counter(answers).most_common(1)[0][0]
    return vote

Critical Parameters

Temperature 0.5-0.7: Too low = identical paths, too high = nonsense
Top-p 0.95: Helps diversity without degeneracy
Parallel sampling: All samples can run concurrently
Answer extraction: Must reliably parse final answer (regex, delimiter)

Limitations & Gotchas

Cost: 10x samples = 10x tokens = 10x cost
Systematic errors: If the model is systematically wrong, voting won't help
Answer space: Works best when answers are discrete (numbers, multiple choice)
Open-ended tasks: Hard to vote on free-form text generation
Parse failures: Must handle cases where answer can't be extracted

Advanced Variations

📊 Weighted Voting

Weight votes by model confidence (log probability of the reasoning chain). Paths the model is more "sure" about get more weight.

🔄 Verifier Models

Train a separate model to score reasoning quality. Use scores instead of raw votes. OpenAI's approach for math reasoning.

🎯 Best-of-N

Instead of voting, use a reward model to pick the best single response. More flexible for open-ended tasks.

arXiv Paper OpenReview (ICLR) Implementation

Least-to-Most 📚 1,400+ citations

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Zhou et al. • Google Research • ICLR 2023

Decompose complex problems into simpler subproblems, solve from easiest to hardest. Each solution feeds into the next, enabling better generalization to harder problems.

The Key Insight: Structured Decomposition Beats Linear Reasoning

CoT asks the model to think through a problem in one shot—it figures out the structure while solving. Least-to-Most separates decomposition from solving: first figure out what subproblems exist, then solve them in order from simplest to most complex.

💡 Key Learning

The magic is in the ordering: solving easier subproblems first provides scaffolding for harder ones. Each solution becomes context for the next, building up to the final answer. This mimics how humans tackle complex problems—break them down, start simple.

The Two-Stage Process

Decomposition Stage

Ask: "To solve [original problem], what subproblems do we need to solve first?" The model outputs an ordered list from simplest to most complex.

Sequential Solving Stage

Solve subproblem 1 → add solution to context → solve subproblem 2 with that context → repeat until the final (hardest) subproblem is solved.

Original Problem
↓ Decompose
Sub-1 (easy) → Sub-2 → Sub-3 (hard)
↓ Solve sequentially
Solve Sub-1 → +context → Solve Sub-2 → +context → Solve Sub-3
↓
Final Answer

Example: Symbol Mapping (SCAN)

Task "jump around left twice" → actions Stage 1: Decomposition Q: "jump around left twice" can be solved by first solving what subproblems? A: We need to solve: 1. "left" (simplest) 2. "jump left" 3. "jump around left" 4. "jump around left twice" (original) Stage 2: Solving Q: "left" → ? A: TURN LEFT Q: "left" means TURN LEFT. "jump left" → ? A: TURN LEFT JUMP Q: "left" means TURN LEFT. "jump left" means TURN LEFT JUMP. "jump around left" → ? A: TURN LEFT JUMP TURN LEFT JUMP TURN LEFT JUMP TURN LEFT JUMP Q: [all previous context] "jump around left twice" → ? A: [full solution]

Benchmark Results

Benchmark	CoT	Least-to-Most	Gain
SCAN (length gen.)	16.0%	99.7%	+83.7%
DROP (reading comp.)	78.7%	82.3%	+3.6%
GSM8K (math)	58.0%	62.4%	+4.4%
CFQ (compositional)	33.1%	94.3%	+61.2%

The SCAN Result is Remarkable

CoT completely fails at length generalization (16%)—if the test sequence is longer than training examples, it can't cope. L2M achieves 99.7% by learning to decompose rather than memorize.

Why Length Generalization Works

Standard CoT fails on problems longer than training examples because it learns to match patterns, not procedures. L2M forces the model to learn compositional rules:

Base cases: Simple primitives like "left" → TURN LEFT
Composition rules: How to combine primitives
Recursive application: Apply rules at any depth

💡 Key Insight

L2M essentially teaches the model a recursive algorithm through examples rather than a fixed pattern. This is why it generalizes to longer inputs—the algorithm works at any length.

When to Use Least-to-Most

Compositional tasks: Where the answer is built from smaller pieces
Length generalization: Test inputs longer than training
Hierarchical problems: Natural parent-child structure
Complex word problems: Multiple interdependent steps
Symbolic reasoning: Rule-based transformations

Comparison: CoT vs Least-to-Most

Aspect	Chain-of-Thought	Least-to-Most
Passes	Single pass	Two-pass (decompose + solve)
Structure	Implicit	Explicit decomposition
Context growth	Linear	Grows with each subproblem
Length gen.	Poor	Excellent
Cost	Lower	Higher (multiple calls)
Best for	Simple reasoning	Compositional tasks

Implementation Considerations

Decomposition prompt: "To solve X, what subproblems need to be solved first?"
Ordering is critical: Must go from easy → hard, not arbitrary
Context accumulation: Each solution is added to context for next subproblem
Multiple API calls: One for decomposition, one per subproblem
Context length: Can grow large—may need truncation strategy

Limitations

When NOT to Use

Simple problems: Overhead isn't worth it. Non-compositional tasks: If the problem doesn't decompose naturally, L2M won't help. Latency-sensitive: Multiple API calls add delay. Ambiguous decomposition: If there's no clear subproblem structure, the decomposition stage may fail.

arXiv Paper OpenReview (ICLR)

CoT w/o Prompting 📚 168 citations

Chain-of-Thought Reasoning Without Prompting

Wang & Zhou et al. • Google DeepMind • 2024

Discovers that CoT reasoning can emerge naturally without demonstrations by decoding with alternative top-k tokens. The model already "knows" how to reason—you just need to find it.

The Surprising Discovery: Models Already Know How to Reason

The standard assumption was that CoT prompting teaches the model to reason. This paper overturns that: the reasoning capability already exists in pretrained models—it's just not activated by default greedy decoding.

💡 The Key Insight

When you decode greedily (always pick the highest probability token), the model often jumps straight to an answer. But if you look at the 2nd, 3rd, or 10th most likely next token, you often find the start of a reasoning chain. The model "knows" it should reason—it just doesn't do it by default.

How It Works: CoT-Decoding

Ask a Question (No CoT Prompt)

Just ask "Q: What is 15 × 7?" without any "think step by step" instruction.

Explore Top-k First Tokens

Instead of greedy decode, look at the top k (e.g., 10) most likely first tokens. Some will be direct answers ("105"), others will be reasoning starters ("Let", "First", "15").

Continue Each Path

Decode each path to completion. The paths that start with reasoning tokens naturally generate CoT-style reasoning.

Select Based on Confidence

Answers generated via CoT paths tend to have higher confidence. Select the path where the model is most confident in the final answer.

What They Found in the Decoding Space

Q: "What is 15 × 7?"

Top-1: "105" (direct answer)
Top-2: "The" → "The answer is 105"
Top-3: "15" → "15 × 7 = 105"
Top-5: "Let" → "Let me calculate: 15 × 7 = 105" ✓
Top-8: "First" → "First, 15 × 7... = 105" ✓

The reasoning paths exist—they're just not the highest probability by default. Prompting with "think step by step" shifts probability mass toward these paths.

Key Experimental Results

CoT paths exist: ~40% of questions had a CoT path in top-10 alternatives
Correlation with correctness: When CoT path exists, the answer is more often correct
Higher confidence: Model assigns higher probability to final answer when CoT is present
No prompting needed: The same reasoning emerges without any "think step by step" instruction

Why This Matters: Deep Implications

💡 Implication 1: Reasoning is Emergent

Models learn to reason from pretraining data (math textbooks, Stack Overflow, etc.). CoT prompting doesn't teach reasoning—it activates existing capability.

💡 Implication 2: Decoding Strategy Matters

Greedy decoding may be leaving performance on the table. Alternative decoding strategies (beam search variants, top-k exploration) could unlock hidden capabilities.

💡 Implication 3: Explains Emergence

This may explain why CoT only works at scale: smaller models may not have learned reasoning from pretraining data, so there's no latent capability to activate.

Practical Applications

Confidence calibration: Presence of CoT in alternatives indicates the model is confident
Automatic CoT detection: Check if reasoning paths exist in top-k to decide if a question needs more careful handling
Better decoding: Can bias decoding toward reasoning tokens for improved accuracy
Research direction: Opens new research into decoding-time interventions

Limitations

Compute intensive: Exploring top-k requires k× more decoding
Not always present: CoT paths only exist ~40% of the time
Smaller models: Effect is weaker in smaller models (they may not have learned reasoning)
Complex questions: For very complex problems, even the CoT paths may be wrong

arXiv Paper Semantic Scholar

Action & Agents Line

ReAct 📚 5,886 citations

ReAct: Synergizing Reasoning and Acting in Language Models

Yao et al. • Princeton + Google • ICLR 2023

The foundation of modern AI agents. Interleaves reasoning traces with actions (search, lookup, etc.), allowing models to plan, act, observe, and revise. Thought → Action → Observation loops.

The Foundational Insight: Reasoning and Acting Must Be Interleaved

Before ReAct, there were two separate paradigms: Chain-of-Thought (pure reasoning, no actions) and Action-Only (just execute actions, no explicit reasoning). Both had limitations—CoT couldn't gather new information, and action-only was uninterpretable and error-prone.

💡 The Key Learning

Humans don't think-then-act or act-then-think—we interleave them continuously. "I need to find X" → search → "Found Y, not X" → revise plan → search again. ReAct captures this interleaved process, making agents both more capable and more interpretable.

The ReAct Loop: Thought → Action → Observation

Thought

The model reasons about what to do next: "I need to find when Obama was born." This is explicit internal reasoning.

Action

The model chooses an action from the available toolkit: Search[Barack Obama], Lookup[born], Calculate[2024-1961], etc.

Observation

The environment returns a result: "Barack Obama was born August 4, 1961..." This feeds back into the next Thought.

Example ReAct Trace Question: What is the elevation range for the area that the eastern sector of the Colorado orogeny extends into? Thought 1: I need to search Colorado orogeny to find the area that the eastern sector extends into. Action 1: Search[Colorado orogeny] Observation 1: The Colorado orogeny was an episode of mountain building... The eastern sector extends into the High Plains. Thought 2: The eastern sector extends into the High Plains. I need to search High Plains and find its elevation range. Action 2: Search[High Plains] Observation 2: High Plains refers to one of two distinct land regions: the High Plains in the United States...elevation range of 1,800 to 7,000 ft. Thought 3: High Plains rise in elevation from around 1,800 to 7,000 ft, so the answer is 1,800 to 7,000 ft. Action 3: Finish[1,800 to 7,000 ft]

Benchmark Results

Benchmark	CoT Only	Act Only	ReAct
HotpotQA (QA)	28%	25%	34%
FEVER (fact-check)	56%	58%	64%
ALFWorld (games)	N/A	45%	71%
WebShop (shopping)	N/A	30%	40%

Why Interleaving Works

Reasoning grounds actions: Thoughts prevent random or hallucinated actions
Actions ground reasoning: Real observations prevent hallucinated facts
Error recovery: When observations don't match expectations, thoughts can revise the plan
Interpretability: The thought trace shows exactly why each action was taken

💡 The Synergy

CoT alone hallucinates facts it doesn't know. Act-only makes random mistakes with no way to recover. ReAct combines the planning of CoT with the grounding of actions—each compensates for the other's weakness.

Standard Action Space

# ReAct's original action space for QA tasks:

Search[query]      # Search Wikipedia for a topic
Lookup[keyword]    # Find keyword in current page
Finish[answer]     # Complete task with final answer

# Extended action spaces in practice:

Calculate[expr]    # Evaluate mathematical expression
Code[python]       # Execute Python code
Browse[url]        # Navigate to a URL
API[endpoint, params]  # Call external API

The Modern Impact: Every Agent Framework Uses This

LangChain Agents: Direct implementation of ReAct loop
AutoGPT/BabyAGI: Extended ReAct with memory and task decomposition
OpenAI Function Calling: Structured version of Action step
Claude Tools: Same pattern—interleave reasoning with tool use
Microsoft Copilot: ReAct-style reasoning for code actions

Implementation Pattern

def react_loop(question, tools, max_steps=10):
    context = f"Question: {question}\n"

    for step in range(max_steps):
        # Generate Thought + Action
        response = llm.generate(context + "Thought:")
        thought, action = parse_thought_action(response)
        context += f"Thought {step}: {thought}\nAction {step}: {action}\n"

        # Execute action and get observation
        if action.startswith("Finish"):
            return extract_answer(action)

        observation = tools.execute(action)
        context += f"Observation {step}: {observation}\n"

    return "Failed to find answer"

Common Failure Modes

Infinite loops: Agent keeps searching without finishing (need max steps)
Hallucinated actions: Agent calls tools that don't exist (need validation)
Ignored observations: Agent ignores what it finds and proceeds with prior beliefs
Premature finish: Agent finishes before gathering enough info
Overthinking: Too many thoughts without action slows progress

Production Tip

Always set max_steps and validate action format. Most failures come from agents getting stuck in loops or calling invalid actions. Structured output (JSON) for actions helps reliability.

ReAct vs Alternatives

Aspect	CoT Only	Act Only	ReAct
Can use tools?	❌ No	✓ Yes	✓ Yes
Interpretable?	✓ Yes	❌ No	✓ Yes
Error recovery?	❌ No	❌ No	✓ Yes
Hallucination risk	High	Low	Low
Token cost	Low	Low	Medium
Latency	Low	Medium	High

arXiv Paper Project Page GitHub Google AI Blog

Reflexion 📚 2,855 citations

Reflexion: Language Agents with Verbal Reinforcement Learning

Shinn et al. • Northeastern + MIT • NeurIPS 2023

Agents learn from failure by reflecting in natural language. After a failed attempt, the agent generates self-critique and stores it as memory for future tries. Verbal RL without weight updates.

The Key Insight: Learning from Failure in Natural Language

Traditional RL uses scalar rewards (success/failure, +1/-1) to update model weights. But LLMs can do something remarkable: they can reflect on why they failed in natural language and use that as context for the next attempt—no weight updates needed.

💡 The Breakthrough

Instead of training: "Trial 1 failed (reward=-1)", Reflexion does: "Trial 1 failed because I forgot to check the edge case where n=0. Next time, I should add explicit handling for empty inputs." This rich linguistic signal is far more informative than a scalar reward.

The Reflexion Loop

Act: Make an Attempt

The agent tries to complete the task (write code, navigate environment, answer question).

Evaluate: Get Feedback

Run tests, check answer, or get environment feedback. Binary (pass/fail) or detailed error messages.

Reflect: Analyze Failure

Generate a natural language reflection: "What went wrong? Why? What should I do differently?"

Store: Add to Memory

Store the reflection in episodic memory. This persists across attempts and tasks.

Retry: Attempt with Context

Try again with reflections in context. The agent now "remembers" what not to do.

Example: Coding Task with Reflexion

Trial 1 Task: Write a function to find the longest palindromic substring. def longest_palindrome(s): # My attempt for length in range(len(s), 0, -1): for i in range(len(s) - length + 1): substr = s[i:i+length] if substr == substr[::-1]: return substr Test Result: FAILED - Test case "" expected "", got None Reflection I failed because my function doesn't handle the empty string case. When s is empty, the for loop doesn't execute and None is returned implicitly. I should add a check for empty input at the start. Trial 2 (with reflection in context) def longest_palindrome(s): if not s: # Handle empty string return "" for length in range(len(s), 0, -1): ... Test Result: PASSED ✓

Benchmark Results

Benchmark	Without Reflexion	With Reflexion	Gain
HumanEval (code)	67.0%	91.0%	+24.0%
MBPP (code)	70.0%	77.1%	+7.1%
ALFWorld (games)	75%	97%	+22%
HotpotQA (QA)	35%	49%	+14%
WebShop (shopping)	40%	59%	+19%

The HumanEval Result

91% pass@1 on HumanEval with GPT-4 was state-of-the-art at publication—achieved purely through reflection, no fine-tuning. This shows how much performance is left on the table without iterative refinement.

Memory Architecture

Short-term memory: Current task trajectory (actions taken, observations received)
Episodic memory: Accumulated reflections from past failures
Semantic memory: General knowledge (comes from LLM pretraining)

💡 Design Insight

Reflections are stored as natural language "lessons learned." Unlike embeddings or weights, they're human-readable and can be curated/edited. You can even manually add reflections to bootstrap learning.

Why It's "Verbal RL"

Reflexion mirrors key RL concepts but in language:

RL Concept	Reflexion Equivalent
Reward signal	Success/failure feedback
Policy gradient	Natural language reflection
Experience replay	Episodic memory of reflections
Exploration	Different approaches in retries
Weight updates	Context updates (no training)

Implementation Tips

Reflection prompt: "Given the error/feedback above, what went wrong and how should I approach this differently?"
Memory management: Keep most recent N reflections, summarize older ones
Max retries: 2-3 iterations usually sufficient; more rarely helps
Task-specific feedback: Code tasks get test output; QA gets ground truth comparison
Reflection quality: More specific reflections work better than generic ones

Limitations

Need for feedback: Requires some evaluation signal (tests, oracle, etc.)
Systematic errors: If model doesn't know the right approach, reflection won't help
Compute cost: Multiple attempts per task increases cost
Latency: Sequential retries add time
Quality ceiling: Eventually converges—more retries don't always help

Reflexion vs Other Self-Improvement Methods

Method	Requires Training?	Multi-attempt?	Memory?	Best For
Self-Consistency	❌	Parallel	❌	Answer selection
Self-Refine	❌	Sequential	❌	Output polish
Reflexion	❌	Sequential	✓	Learning from failure
RLHF	✓	N/A	Weights	General improvement

arXiv Paper GitHub Video Walkthrough Blog Post

Tree & Search Line

ToT 📚 4,856 citations

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Yao et al. • Princeton + Google DeepMind • NeurIPS 2023

Generalizes CoT to tree search. Generate multiple "thoughts," evaluate them, and explore/backtrack. Enables deliberate planning and lookahead that linear CoT cannot achieve.

The Key Insight: Reasoning as Search, Not a Path

Chain-of-Thought generates a single linear path through the reasoning space. But many problems require exploration—you might go down a wrong path and need to backtrack. Tree of Thoughts models reasoning as search over a tree, enabling the model to explore multiple branches and back up from dead ends.

💡 The Breakthrough

Some problems have no clear solution path until you've tried a few approaches. ToT lets the model say "let me try approach A... that didn't work... let me try approach B" instead of committing to one path. This is how humans solve puzzles.

The Four Components of ToT

Thought Decomposition

Define what a "thought" is for this problem—a line of code, a sentence, a partial solution, an equation. The granularity matters: too small = too many branches, too large = loses the benefit.

Thought Generator

Generate k candidate next-thoughts from the current state. Either sample (generate k different outputs) or propose (ask LLM to suggest k options in one call).

State Evaluator

Use the LLM to assess how promising each state is: "Can this partial solution lead to success?" Returns a score or categorical rating (sure/maybe/impossible).

Search Algorithm

Use BFS (breadth-first: explore level by level) or DFS (depth-first: go deep then backtrack) to navigate the tree. Which to use depends on the problem structure.

Example: Game of 24

Problem Use numbers 4, 5, 6, 10 with +, -, *, / to make 24. Each number used exactly once. Tree Search Root: {4, 5, 6, 10} Level 1 - Generate operations: ├── 10 - 6 = 4 → {4, 4, 5} [Evaluator: "maybe - two 4s could work"] ├── 10 - 4 = 6 → {5, 6, 6} [Evaluator: "maybe"] ├── 10 + 6 = 16 → {4, 5, 16} [Evaluator: "sure - 16 + 4 + 5 - 1 close"] ├── 5 * 4 = 20 → {6, 10, 20} [Evaluator: "impossible - too big"] ✗ PRUNE └── ... Level 2 - Continue from promising nodes: From {4, 4, 5}: ├── 4 + 4 = 8 → {5, 8} │ └── 5 * 8 = 40 ✗ ├── 4 * 4 = 16 → {5, 16} │ └── ... └── 5 - 4 = 1 → {1, 4} └── ... nope, backtrack Eventually find: (10 - 4) * (6 - 5 + 1)... or another valid path ✓

Benchmark Results

Task	CoT	ToT (BFS)	Gain
Game of 24	4.0%	74.0%	+70%
Creative Writing	6.2	7.6	+1.4
Mini Crosswords	15.6%	35.4%	+20%

The Game of 24 Result

CoT achieves only 4% on Game of 24 because it commits to operations that might be wrong. ToT achieves 74% by exploring multiple operation sequences and backtracking. This is exactly the type of problem where search beats linear reasoning.

BFS vs DFS: When to Use Which

Strategy	How It Works	Best For
BFS	Explore all level-n nodes before level-n+1	When solutions are likely at similar depths
DFS	Go deep on one path, backtrack if stuck	When you need to see full solutions to evaluate

Game of 24: Uses BFS because partial equations can be evaluated.
Creative Writing: Uses DFS because you need full paragraphs to judge coherence.

The Evaluator Prompt

# Example evaluator prompt for Game of 24:

Evaluate if given numbers can reach 24 (sure/maybe/impossible).

10 14
10 + 14 = 24
sure

11 12
11 + 12 = 23, 11 * 12 = 132, 12 - 11 = 1, 11 / 12 = 0.91
impossible

4 4 10
4 + 4 + 10 = 18, 4 * 4 - 10 = 6, (10 - 4) * 4 = 24
sure

{input_numbers}

When to Use ToT

Combinatorial puzzles: Game of 24, Sudoku, crosswords
Creative generation: Where multiple paths could work
Planning problems: Route finding, scheduling
Hard reasoning: When first attempt likely fails
NOT for: Simple QA, factual questions, or problems with clear solution paths

Cost-Benefit Analysis

Compute: 10-100x more LLM calls than single CoT
Latency: Sequential exploration adds significant time
When worth it: Hard problems where CoT fails, high-stakes accuracy
When not worth it: Simple problems, latency-critical, budget-constrained

💡 Practical Guidance

Try CoT first. If it fails consistently on a problem type, consider ToT. For production, use ToT selectively on the hardest queries only—use a classifier to decide.

arXiv Paper Official GitHub Prompting Guide

LLM-Guided ToT 📚 314 citations

Large Language Model Guided Tree-of-Thought

Long et al. • 2023

Uses an LLM to guide the tree exploration itself, rather than fixed BFS/DFS. The LLM learns which branches to explore, making search more efficient and targeted.

Key Improvements

LLM decides which node to expand next
More efficient than blind BFS/DFS
Learns from evaluation history
Adaptive search depth based on problem

Core Technique

Instead of fixed search strategy, prompt the LLM with the current tree state and ask: "Which thought should we explore next?" The LLM acts as both the thought generator and the search controller.

arXiv Paper

Program-Based Line

PoT 📚 1,083 citations

Program of Thoughts Prompting: Disentangling Computation from Reasoning

Chen et al. • CMU + Salesforce • 2022

Instead of reasoning in natural language, generate Python code that performs the computation. Execute the code to get the answer. Separates "what to compute" from "how to compute."

The Key Insight: Let the Computer Do the Computing

CoT asks the LLM to both reason about what to compute AND perform the computation. But LLMs are bad at arithmetic—they make mistakes on "what is 37 × 89?". PoT separates these: the LLM writes code describing the computation, then Python executes it exactly.

💡 The Breakthrough

LLMs are reasoning engines, not calculators. When you ask an LLM "what is 127 × 38 + 94?", it's doing pattern matching—and sometimes gets it wrong. But when you ask "write code to compute this", it generates code and Python gets it exactly right every time.

CoT vs PoT: Side by Side

Chain-of-Thought Q: Janet pays $40/hour. She works 8 hours on Mon, Tue, Wed, and 6 hours on Thu, Fri. Weekly pay? A: Mon-Wed: 40 × 8 = 320 per day Three days: 320 × 3 = 960 Thu-Fri: 40 × 6 = 240 per day Two days: 240 × 2 = 480 Total: 960 + 480 = 1440 ❌ Manual math can have errors

Program of Thoughts hourly = 40 mon_wed = hourly * 8 * 3 thu_fri = hourly * 6 * 2 total = mon_wed + thu_fri print(total) # 1440 ✓ Python is always exact

Benchmark Results

Benchmark	CoT	PoT	Gain
GSM8K (math)	58.0%	80.0%	+22%
AQuA (algebra)	35.8%	58.0%	+22%
SVAMP (math)	78.2%	88.0%	+10%
TabMWP (tables)	62.0%	76.8%	+15%

When PoT Excels

Multi-step arithmetic: Compound interest, tax calculations
Iterative problems: "After N days...", simulation
Table/data operations: Filtering, aggregating
NOT for: Commonsense reasoning, strategy problems

The Pattern

Use PoT whenever the answer requires computation that Python can execute. If the answer is a number from calculation, PoT beats CoT.

Limitations

Execution needed: Must run Python safely (sandbox)
Code errors: Syntax/runtime errors still happen
Logic errors: Correct math on wrong formula = wrong answer
Non-computational tasks: Can't help with reasoning-only problems

arXiv Paper GitHub PDF

2025 Survey Papers: The State of the Art

System 1→2 📚 2025 Survey

From System 1 to System 2: A Survey of Reasoning Large Language Models

Li et al. • Feb 2025

Comprehensive survey covering the transition from "fast thinking" (System 1) to "slow, deliberate reasoning" (System 2) in LLMs. Covers construction methods, benchmarks, and open problems.

The Core Framework: System 1 vs System 2

This survey borrows Daniel Kahneman's dual-process theory to explain LLM evolution:

System 1 (Fast Thinking) • Pattern matching on training data • Single forward pass per token • "What's 2+2?" → "4" (instant) • Prone to hallucination on novel tasks • GPT-4, Claude 3.5 in standard mode

System 2 (Slow Thinking) • Deliberate step-by-step reasoning • Many tokens of "thinking" before answer • "What's 127×38?" → [long reasoning] → "4826" • More accurate on complex reasoning • o1, o3, R1, Claude in thinking mode

The Progression: How We Got to Reasoning Models

Prompt Engineering Era (2022)

CoT, Self-Consistency, Least-to-Most. External prompts unlock reasoning without changing the model.

Supervised Fine-Tuning (2023)

Train on reasoning traces. Models learn to generate CoT by default without prompting.

Reinforcement Learning (2024)

RL from outcome feedback. o1 and R1 learn to reason through trial-and-error, not just imitation.

Test-Time Scaling (2024-25)

More thinking tokens at inference = better answers. Trading compute for capability.

Key Insight for Practitioners

💡 The Trade-off

System 2 reasoning costs 10-100x more tokens but solves problems System 1 cannot. The skill is knowing when to use each—don't pay System 2 costs for System 1 problems.

Survey Coverage

150+ papers systematically reviewed
Construction methods: prompting → SFT → RL → hybrid
Benchmarks: GSM8K, MATH, ARC, coding tasks
Open problems: efficiency, verification, generalization

arXiv Paper Awesome List (Updated)

LRM Survey 📚 2025 Survey

Towards Large Reasoning Models: A Survey of Reinforced Test-Time Scaling

Xu et al. • Jan 2025

Focuses on the transition from LLMs to "Large Reasoning Models" through test-time scaling. Covers o1-style long CoT, automated data construction, and RL-based learning-to-reason.

The New Paradigm: LLM → LRM

This survey defines the transition from Large Language Models to Large Reasoning Models—models specifically trained to "think" before answering:

Aspect	LLM (GPT-4, Claude)	LRM (o1, R1)
Output style	Direct answer	Thinking trace → answer
Tokens per response	100s-1000s	10,000s-100,000s
Training	Next-token prediction	RL on reasoning outcomes
Scaling axis	Parameters (train-time)	Thinking tokens (test-time)
Best for	Creative, conversational	Math, logic, coding

The Test-Time Scaling Revolution

Traditional scaling: "Bigger model = better results" (train-time compute). The LRM insight:

💡 Test-Time Scaling

More thinking tokens at inference = better reasoning, even with the same model. o1 can "think longer" on hard problems. This shifts the cost from training to inference—you pay per query, not per model.

How LRMs Are Built

Data: Long CoT traces

Generate or curate reasoning traces with verification steps

Training: RL from outcomes

Reward correct final answers, let model discover good reasoning

Inference: Variable compute

Harder problems get more thinking tokens automatically

Practical Implications

Cost model changes: Pay per reasoning step, not just per token
Latency increases: Expect 10-60 seconds for complex reasoning
When to use: Math, coding, logic puzzles—NOT chat or creativity
DIY option: Fine-tune on reasoning traces + RL (see DeepSeek R1)

arXiv Paper

Efficiency 📚 2025 Survey

Efficient Reasoning Models: A Survey

Feng et al. • Apr 2025

Addresses the efficiency crisis in reasoning models. Organizes solutions into three axes: Shorter (compressed CoT), Smaller (distilled models), Faster (accelerated decoding).

The Efficiency Crisis

Reasoning models achieve breakthrough accuracy but at extreme cost:

Metric	Standard LLM	Reasoning Model (o1)	Impact
Tokens per query	~500	~50,000+	100x cost
Latency	1-3 seconds	30-120 seconds	UX impact
Cost per query	$0.001-0.01	$0.10-1.00	$$ scaling
GPU memory	Moderate	Long context needed	Infra cost

The Three Efficiency Axes

📏 SHORTER Compress reasoning length • Concise CoT training • Reasoning distillation • Skip unnecessary steps • "Say less, reason better" Goal: Same accuracy, fewer tokens

📦 SMALLER Smaller model, same capability • Knowledge distillation • Reasoning transfer learning • Model compression + RL • 7B matching 70B Goal: Commodity GPU deployment

⚡ FASTER Accelerate inference • Speculative decoding • Parallel reasoning paths • Early exit strategies • KV cache optimization Goal: Real-time reasoning

Key Techniques

Distillation: Train small model on big model's reasoning traces (DeepSeek-R1-Distill)
Adaptive compute: Easy problems → short thinking; Hard → long thinking
Compressed CoT: Train to reason in fewer tokens without accuracy loss
Speculative decoding: Draft tokens with small model, verify with large

Practical Takeaway

💡 The 10x Rule

For production deployment, you need reasoning efficiency to improve 10x from current state-of-art. Watch distillation research closely—7B distilled models are already matching 70B on many benchmarks.

arXiv Paper

Logical 📚 2025 Survey

Logical Reasoning in Large Language Models: A Survey

Liu et al. • Feb 2025

Deep dive into formal logical reasoning: deductive, inductive, abductive, and analogical. Covers neuro-symbolic methods, RL approaches, and why LLMs still struggle with rigorous logic.

The Four Types of Logical Reasoning

Type	Direction	Example	LLM Performance
Deductive	General → Specific	"All humans are mortal. Socrates is human. ∴ Socrates is mortal."	Good with simple chains, fails on complex
Inductive	Specific → General	"Swan 1 is white, Swan 2 is white... ∴ All swans are white"	Prone to overgeneralization
Abductive	Effect → Best Cause	"The grass is wet. Best explanation: It rained."	Reasonable but not rigorous
Analogical	Domain A → Domain B	"Atom is like solar system: nucleus = sun, electrons = planets"	Creative but often superficial

Why LLMs Struggle with Formal Logic

💡 The Core Problem

LLMs do soft pattern matching, not hard symbolic manipulation. They've seen "Socrates is mortal" in training data, so they pattern-match to the right answer—but they can't reliably apply modus ponens to novel entities. The survey shows LLMs fail on negation ("not"), quantifiers ("all", "some"), and multi-hop reasoning.

Enhancement Approaches

Data-centric: Fine-tune on formal logic datasets (ProofWriter, FOLIO)
Neuro-symbolic: LLM generates logical forms → external solver executes
RL for logic: Reward valid inference chains
Constrained decoding: Force outputs to follow logical grammar

Practical Implications

Don't trust LLMs for formal proofs without verification
Use hybrid systems for logic-critical applications
Test edge cases: Negation, quantifier scope, contradiction
Best results: LLM proposes, symbolic system verifies

arXiv Paper

Trustworthy 📚 2025 Survey

A Comprehensive Survey on Trustworthiness in Reasoning Models and Chain-of-Thought

Wang et al. • 2025

Examines reasoning through the lens of trustworthiness: truthfulness, safety, robustness, fairness, and privacy. Critical for deploying reasoning systems in production.

The Five Pillars of Trustworthy Reasoning

Pillar	Definition	Failure Example	Mitigation
Truthfulness	Reasoning reflects reality	Confident but wrong math steps	Self-consistency, verification
Safety	No harmful outputs	Reasoning toward dangerous instructions	Output filtering, RLHF
Robustness	Works under attack	Jailbreaks via reasoning injection	Adversarial training
Fairness	Unbiased reasoning	Different conclusions for similar cases	Debiasing, auditing
Privacy	No data leakage	Reasoning reveals training examples	Differential privacy

The Amplification Problem

💡 Why Long Reasoning is Riskier

Extended reasoning (o1-style) amplifies both capabilities and risks. Each reasoning step is an opportunity for: (1) accumulating errors, (2) introducing biases, (3) generating unsafe content, (4) leaking private information. A 100-step reasoning chain has 100x the attack surface of a direct answer.

Key Findings

Faithfulness gap: CoT often doesn't reflect actual model computation
Sycophancy risk: Reasoning can be steered by user preferences
Jailbreak vectors: Reasoning steps can be exploited for attacks
Consistency issues: Same problem, different reasoning, different answers

Production Checklist

✓ Verify reasoning chains independently of final answer
✓ Monitor for reasoning drift over conversation
✓ Test with adversarial inputs before deployment
✓ Audit for demographic bias in reasoning paths
✓ Implement early termination for suspicious reasoning

arXiv Paper

Implicit 📚 2025 Survey

Implicit Reasoning in Large Language Models

Li et al. • 2025

Can models reason without generating explicit CoT? This survey explores "implicit reasoning" via latent structures, trading off interpretability for efficiency.

The Core Trade-off: Explicit vs Implicit Reasoning

Explicit Reasoning (CoT) Reasoning visible in output tokens Q: What is 23 × 17? A: Let me think step by step... 23 × 17 = 23 × (10 + 7) = 230 + 161 = 391 ✓ Interpretable, debuggable ✗ Slow (100s-1000s tokens) ✗ Expensive ($$$)

Implicit Reasoning Reasoning in hidden states Q: What is 23 × 17? A: 391 (Reasoning happened in the neural network activations, not visible to users) ✓ Fast (minimal tokens) ✓ Cheap ✗ Opaque, unverifiable ✗ Hard to debug failures

The Research Frontier

💡 The Best of Both Worlds?

Can we train models that reason implicitly (fast, cheap) but achieve explicit reasoning accuracy? Early research shows promise: "pause tokens" let models think without visible output, latent reasoning heads show computation happening in hidden states. The goal: o1-level reasoning at GPT-4 speed.

Approaches Surveyed

Pause tokens: Special tokens that give model "thinking time" without output
Latent reasoning: Train hidden states to perform computation
Compressed CoT: Distill long reasoning into internal representations
Recurrent depth: Loop through layers multiple times for harder problems

Why This Matters

Cost: Implicit reasoning could reduce inference cost 100x
Latency: Real-time reasoning applications become possible
Verification challenge: How do you trust reasoning you can't see?
Human parallel: We don't verbalize every thought—maybe AI shouldn't either

arXiv Paper