A metro map of how language models learned to think step by step
The reasoning landscape is evolving rapidly. Here are the key trends shaping 2025 and beyond.
The new scaling law: more thinking tokens = better results. o1/o3 and R1 prove you can trade inference cost for capability. The frontier is now efficiency—getting the same gains with fewer tokens.
The winning pattern: generate candidates, verify with a separate model/tool, iterate. External verifiers (code execution, formal proofs, unit tests) unlock gains independent of base model quality.
Large reasoning models (o1, R1) generate training data for smaller models. DeepSeek-R1-Distill-Qwen-7B matches GPT-4o on math. The moat isn't the model—it's the reasoning traces.
Not all problems need the same thinking budget. The next frontier: models that know when to think longer. Easy questions get fast answers; hard ones get extended reasoning.
Pure language reasoning hits limits. The winners combine LLM reasoning with code execution (PoT), search (ReAct), and formal verification. Hybrid systems dominate benchmarks.
"Slow thinking" generates massive token counts. Research explores: compressed CoT (say less), distillation (smaller models), speculative decoding (faster generation). Cost per reasoning step must drop 10-100x.
Don't just learn these techniques—know when to use each one. This decision tree distills 10 papers into actionable guidance.
Let Python do the math. +22% on GSM8K.
Step-by-step natural language reasoning.
Interleave reasoning with tool calls.
Try, fail, reflect, retry. +24% on HumanEval.
Explore multiple paths. 4% → 74% on Game of 24.
Single path reasoning is enough.
Decompose, solve easy→hard. 99.7% on SCAN.
| Technique | Best For | Cost | Key Gain | Don't Use When |
|---|---|---|---|---|
| CoT | Multi-step reasoning | 1x | +40% GSM8K | Simple lookups, pattern matching |
| Self-Consistency | High-stakes accuracy | 10-40x | +16% over CoT | Budget-constrained, latency-critical |
| Least-to-Most | Compositional tasks | 2-5x | +84% SCAN | Non-decomposable problems |
| ReAct | Tool use, grounding | 3-10x | +10% FEVER | Closed-book reasoning |
| Reflexion | Iterative improvement | 2-5x | +24% HumanEval | No feedback signal available |
| Tree of Thoughts | Search/exploration | 10-100x | +70% Game24 | Simple problems, cost matters |
| Program of Thoughts | Math-heavy tasks | 1x + exec | +22% GSM8K | Non-computational reasoning |
The mistake: Applying "think step by step" to simple factual questions.
Why it's wrong: CoT adds latency and can confabulate reasoning for simple lookups. "What's the capital of France?" doesn't need step-by-step.
Fix: Classify queries first. Use CoT only for multi-step reasoning.
The mistake: Using Tree of Thoughts for problems where CoT suffices.
Why it's wrong: ToT is 10-100x more expensive. If CoT works, you're burning money.
Fix: Try CoT first. Only use ToT when CoT consistently fails.
The mistake: Trying to use Reflexion without a way to evaluate attempts.
Why it's wrong: Reflexion needs signal. No feedback = no learning = just burning tokens.
Fix: Ensure you have tests, verifiers, or ground truth before using Reflexion.
The mistake: Running Self-Consistency with greedy decoding.
Why it's wrong: Temperature=0 produces identical outputs. You're just generating the same answer N times.
Fix: Use temperature 0.5-0.7 for diversity.
The mistake: Generating code for commonsense reasoning questions.
Why it's wrong: "Is a penguin a bird?" doesn't benefit from Python. The model will generate silly code.
Fix: Only use PoT when there's actual computation to offload.
The mistake: Running ReAct agents without step limits.
Why it's wrong: Agents can loop forever, searching repeatedly without finishing.
Fix: Always set max_steps (5-10 for most tasks).
Begin with basic prompting → add CoT if accuracy is low → add Self-Consistency for high-stakes → consider ToT only for genuinely hard problems. Most production systems don't need ToT.
Not all queries need the same treatment. A lightweight classifier (or even a regex) can route simple queries to fast paths and complex queries to expensive techniques.
PoT + Self-Consistency: Generate multiple code solutions, vote on outputs. ReAct + Reflexion: Retry failed tool calls with reflection. The techniques compose.
Track: accuracy, latency (p50/p95/p99), cost per query, failure modes. A technique that's +5% accuracy but +1000% cost may not be worth it.
The foundational paper that started it all. Adding few-shot examples with explicit step-by-step reasoning dramatically boosts performance on arithmetic, symbolic, and commonsense reasoning.
Instead of greedy decoding, sample multiple diverse reasoning paths and take a majority vote. Turns CoT into an ensemble method with significant accuracy gains.
Decompose complex problems into simpler subproblems, solve from easiest to hardest. Each solution feeds into the next, enabling better generalization to harder problems.
Discovers that CoT reasoning can emerge naturally without demonstrations by decoding with alternative top-k tokens. The model already "knows" how to reason—you just need to find it.
The foundation of modern AI agents. Interleaves reasoning traces with actions (search, lookup, etc.), allowing models to plan, act, observe, and revise. Thought → Action → Observation loops.
Agents learn from failure by reflecting in natural language. After a failed attempt, the agent generates self-critique and stores it as memory for future tries. Verbal RL without weight updates.
Generalizes CoT to tree search. Generate multiple "thoughts," evaluate them, and explore/backtrack. Enables deliberate planning and lookahead that linear CoT cannot achieve.
Uses an LLM to guide the tree exploration itself, rather than fixed BFS/DFS. The LLM learns which branches to explore, making search more efficient and targeted.
Instead of reasoning in natural language, generate Python code that performs the computation. Execute the code to get the answer. Separates "what to compute" from "how to compute."
Comprehensive survey covering the transition from "fast thinking" (System 1) to "slow, deliberate reasoning" (System 2) in LLMs. Covers construction methods, benchmarks, and open problems.
Focuses on the transition from LLMs to "Large Reasoning Models" through test-time scaling. Covers o1-style long CoT, automated data construction, and RL-based learning-to-reason.
Addresses the efficiency crisis in reasoning models. Organizes solutions into three axes: Shorter (compressed CoT), Smaller (distilled models), Faster (accelerated decoding).
Deep dive into formal logical reasoning: deductive, inductive, abductive, and analogical. Covers neuro-symbolic methods, RL approaches, and why LLMs still struggle with rigorous logic.
Examines reasoning through the lens of trustworthiness: truthfulness, safety, robustness, fairness, and privacy. Critical for deploying reasoning systems in production.
Can models reason without generating explicit CoT? This survey explores "implicit reasoning" via latent structures, trading off interpretability for efficiency.