Reasoning Models
A reasoning model is a language model trained (and often inference-tuned) to spend extended compute on internal thinking before answering. The breakthrough of late 2024 — OpenAI’s o1, then o3, DeepSeek R1, Claude with extended thinking, Gemini 2 Thinking — and a major axis of capability gains in 2025–2026.
What changed
Pre-reasoning models (GPT-4, Claude 3, etc.) generated answers token-by-token, fast, with chain-of-thought (CoT) as an optional prompting trick.
Reasoning models:
- Are trained to produce long internal CoT — sometimes thousands of tokens — before the final answer.
- The internal CoT is often hidden from users; only the final answer is shown.
- Test-time compute (how long the model “thinks”) is a tunable knob.
- Quality on hard reasoning tasks (math, code, science) jumped categorically from non-reasoning to reasoning models — often crossing pass@1 thresholds that years of pretraining hadn’t.
How they’re trained
Open methods (the most public picture is from DeepSeek-R1):
- Cold-start data: handcrafted CoT examples to bootstrap the format.
- Reinforcement learning with a reward model that scores final answers (e.g. correctness on math problems).
- Self-improvement: the model generates many candidate solutions; correct ones are kept and trained on.
- Distillation: the reasoning ability is distilled into smaller models via supervised training on reasoning traces.
The key trick: rewarding correct final answers, not the intermediate trace, lets the model discover its own CoT style. R1 famously developed “aha” moments and self-correction during training.
Test-time compute scaling
A second axis of scaling beyond pretraining. Roughly:
Doubling test-time compute on a hard task can match a 10× pretrain-compute bump.
Mechanisms:
- Longer CoT: think for more tokens.
- Multiple samples + voting: generate
nsolutions, take the majority answer. - Best-of-N with verifier: generate
n, score each, pick highest. - Tree search / MCTS: explore reasoning branches, prune.
For frontier benchmarks (AIME math, codeforces), test-time compute is now the dominant lever.
When to use a reasoning model
| Use case | Reasoning model? |
|---|---|
| Math problem | Yes |
| Multi-step logic puzzle | Yes |
| Coding with non-trivial structure | Often |
| RAG over docs | Sometimes — depends on complexity |
| Simple Q&A | No (overkill, slow) |
| Extraction / classification | No |
| Chat / casual | No |
| Tool use with branching decisions | Usually yes |
| Agent loops | Often yes |
Reasoning models are slower and more expensive. They’re the right tool when the difficulty justifies the cost.
Visible vs hidden reasoning
Different providers handle the CoT differently:
- OpenAI o-series: hides the CoT entirely; you pay tokens for it but don’t see it.
- DeepSeek R1: shows the CoT.
- Claude with extended thinking: shows reasoning trace (configurable).
- Gemini Thinking: shows reasoning.
For products, decide whether to expose reasoning to end users — it builds trust but can confuse, leak information, or bore.
Cost model
Reasoning models charge for the internal reasoning tokens, not just the visible output. A model that “thinks” for 10k tokens before producing 200 visible tokens charges for 10,200 tokens.
This makes reasoning model economics very different:
- Single hard problem: $0.10–$1.00 per query (vs $0.001 for a normal LLM call).
- Latency: 10–60s per query (vs 1–5s).
Open-source reasoning models
By early 2026:
- DeepSeek R1 / R1-Distill family — open-weights, runs locally.
- Qwen3 with reasoning mode.
- Llama-4 reasoning variants.
- Claude Sonnet 4.6 / Opus 4.6 / 4.7 — reasoning is built in (extended thinking).
- Marco-o1, OpenThinker, etc. — open reasoning research models.
The gap between open and closed reasoning models is small and shrinking.
Common patterns when integrating
- Two-tier routing: a fast model decides whether the question needs reasoning; route accordingly.
- Time budget: cap thinking tokens; fall back to a quick answer if exceeded.
- Human-readable summary: even when CoT is hidden, ask the model to summarize its reasoning at the end for the user.
- Cache aggressively: reasoning queries often repeat; cache by exact-match or semantic similarity.
- Eval differently: reasoning models often need eval prompts that allow long output.
Quirks and pitfalls
- Overthinking: simple questions can derail into long unproductive CoT. Use a fast model for triage.
- CoT hallucination: a confident-sounding chain of thought can still produce wrong answers. Verify externally where possible.
- Format drift: the model’s output style during reasoning differs from non-reasoning. Few-shot examples may behave oddly.
- Latency variance: a 10s p50 might be a 90s p99 on hard problems.
What’s next
- Reasoning + agents: o3-style models doing multi-step tool use natively.
- Better verifiers: process-level reward models that score each reasoning step, not just the answer.
- Self-play: models that reason against themselves to improve.
- Reasoning over code execution: tighter integration of computation with thinking.
Real-world case studies
- Field report: DeepSeek-R1 — pure-RL reasoning training with GRPO, the published “aha moment” emergence, and the response-distillation recipe into six smaller bases. The most concrete public reference for everything in this article.
See also
- Scaling laws — and test-time scaling
- Frontier architectures
- Stage 08 — Chain-of-thought
- Stage 11 — Agent planning