Reasoning Models

A reasoning model is a language model trained (and often inference-tuned) to spend extended compute on internal thinking before answering. The breakthrough of late 2024 — OpenAI’s o1, then o3, DeepSeek R1, Claude with extended thinking, Gemini 2 Thinking — and a major axis of capability gains in 2025–2026.

What changed

Pre-reasoning models (GPT-4, Claude 3, etc.) generated answers token-by-token, fast, with chain-of-thought (CoT) as an optional prompting trick.

Reasoning models:

  1. Are trained to produce long internal CoT — sometimes thousands of tokens — before the final answer.
  2. The internal CoT is often hidden from users; only the final answer is shown.
  3. Test-time compute (how long the model “thinks”) is a tunable knob.
  4. Quality on hard reasoning tasks (math, code, science) jumped categorically from non-reasoning to reasoning models — often crossing pass@1 thresholds that years of pretraining hadn’t.

How they’re trained

Open methods (the most public picture is from DeepSeek-R1):

  1. Cold-start data: handcrafted CoT examples to bootstrap the format.
  2. Reinforcement learning with a reward model that scores final answers (e.g. correctness on math problems).
  3. Self-improvement: the model generates many candidate solutions; correct ones are kept and trained on.
  4. Distillation: the reasoning ability is distilled into smaller models via supervised training on reasoning traces.

The key trick: rewarding correct final answers, not the intermediate trace, lets the model discover its own CoT style. R1 famously developed “aha” moments and self-correction during training.

Test-time compute scaling

A second axis of scaling beyond pretraining. Roughly:

Doubling test-time compute on a hard task can match a 10× pretrain-compute bump.

Mechanisms:

  • Longer CoT: think for more tokens.
  • Multiple samples + voting: generate n solutions, take the majority answer.
  • Best-of-N with verifier: generate n, score each, pick highest.
  • Tree search / MCTS: explore reasoning branches, prune.

For frontier benchmarks (AIME math, codeforces), test-time compute is now the dominant lever.

When to use a reasoning model

Use caseReasoning model?
Math problemYes
Multi-step logic puzzleYes
Coding with non-trivial structureOften
RAG over docsSometimes — depends on complexity
Simple Q&ANo (overkill, slow)
Extraction / classificationNo
Chat / casualNo
Tool use with branching decisionsUsually yes
Agent loopsOften yes

Reasoning models are slower and more expensive. They’re the right tool when the difficulty justifies the cost.

Visible vs hidden reasoning

Different providers handle the CoT differently:

  • OpenAI o-series: hides the CoT entirely; you pay tokens for it but don’t see it.
  • DeepSeek R1: shows the CoT.
  • Claude with extended thinking: shows reasoning trace (configurable).
  • Gemini Thinking: shows reasoning.

For products, decide whether to expose reasoning to end users — it builds trust but can confuse, leak information, or bore.

Cost model

Reasoning models charge for the internal reasoning tokens, not just the visible output. A model that “thinks” for 10k tokens before producing 200 visible tokens charges for 10,200 tokens.

This makes reasoning model economics very different:

  • Single hard problem: $0.10–$1.00 per query (vs $0.001 for a normal LLM call).
  • Latency: 10–60s per query (vs 1–5s).

Open-source reasoning models

By early 2026:

  • DeepSeek R1 / R1-Distill family — open-weights, runs locally.
  • Qwen3 with reasoning mode.
  • Llama-4 reasoning variants.
  • Claude Sonnet 4.6 / Opus 4.6 / 4.7 — reasoning is built in (extended thinking).
  • Marco-o1, OpenThinker, etc. — open reasoning research models.

The gap between open and closed reasoning models is small and shrinking.

Common patterns when integrating

  1. Two-tier routing: a fast model decides whether the question needs reasoning; route accordingly.
  2. Time budget: cap thinking tokens; fall back to a quick answer if exceeded.
  3. Human-readable summary: even when CoT is hidden, ask the model to summarize its reasoning at the end for the user.
  4. Cache aggressively: reasoning queries often repeat; cache by exact-match or semantic similarity.
  5. Eval differently: reasoning models often need eval prompts that allow long output.

Quirks and pitfalls

  • Overthinking: simple questions can derail into long unproductive CoT. Use a fast model for triage.
  • CoT hallucination: a confident-sounding chain of thought can still produce wrong answers. Verify externally where possible.
  • Format drift: the model’s output style during reasoning differs from non-reasoning. Few-shot examples may behave oddly.
  • Latency variance: a 10s p50 might be a 90s p99 on hard problems.

What’s next

  • Reasoning + agents: o3-style models doing multi-step tool use natively.
  • Better verifiers: process-level reward models that score each reasoning step, not just the answer.
  • Self-play: models that reason against themselves to improve.
  • Reasoning over code execution: tighter integration of computation with thinking.

Real-world case studies

  • Field report: DeepSeek-R1 — pure-RL reasoning training with GRPO, the published “aha moment” emergence, and the response-distillation recipe into six smaller bases. The most concrete public reference for everything in this article.

See also