Field report: DeepSeek-R1 — reasoning from pure RL, in the open

Field report. Observational study based on published sources. All claims cite the original paper or official releases. Inference is marked explicitly. As of 2026-05-01.

DeepSeek-R1 (January 2025) is the most consequential open-frontier-tier release of the reasoning era. It’s the closest thing to a public recipe for o1-class reasoning behavior, and it sits next to the curriculum’s /articles/07-modern-llms/reasoning-models article as the real-world worked example.

This is a field report, not a tutorial. We extract what the paper actually says, mark what’s still confidential, and stop where the paper stops.

What was released

  • DeepSeek-R1-Zero — base model (DeepSeek-V3-base, MoE, 671B total params, 37B active per token) trained with reinforcement learning only, no supervised fine-tuning at all.
  • DeepSeek-R1 — adds a small amount of cold-start SFT data, then multi-stage RL training. The shipping product.
  • Distilled variants — R1’s reasoning behavior distilled into Qwen-1.5B, Qwen-7B, Qwen-14B, Qwen-32B, Llama-3.1-8B, Llama-3.3-70B. Open weights, MIT-style license.
  • DeepSeek-R1 paper (arxiv:2501.12948, “Incentivizing Reasoning Capability in LLMs via Reinforcement Learning”).

The headline claim: a base model trained with RL alone (R1-Zero) develops emergent reasoning capabilities — chain-of-thought, self-correction, longer thinking on harder problems — without ever being shown a reasoning trace. The paper documents a moment they call the “aha moment”: mid-rollout, the model spontaneously starts revising its earlier reasoning.

What the paper actually says

Three pieces of the paper map directly onto the curriculum:

1. GRPO instead of PPO

Per Section 2.2, training uses Group Relative Policy Optimization (GRPO), introduced in DeepSeek’s earlier DeepSeekMath paper. The simplification vs PPO:

  • PPO: needs a learned value model to estimate baselines. Two networks training, more compute, more variance.
  • GRPO: sample N responses per prompt, compute reward for each, advantage = reward − group_mean(rewards). No value model.

In curriculum vocabulary (see /articles/10-fine-tuning/rlhf-dpo-grpo): GRPO trades a learned baseline for a sampled baseline. Cheaper, simpler, works well when rewards are verifiable (math/code where you can grade objectively).

2. Pure RL on a base model

R1-Zero is trained with no SFT at all. The base model is DeepSeek-V3-base. RL pushes it directly toward higher reward on math and code problems, where reward is computed from verified answers (numeric match for math, test-pass rate for code).

Per Section 2.2.4, the paper reports that during training, R1-Zero’s average response length grows substantially — from ~1k tokens early in training to ~10k tokens at convergence. This is the model learning to think longer on its own, not because anyone told it to. The paper presents this as primary evidence of emergent reasoning.

3. The “aha moment”

Section 2.2.4 of the paper documents a specific behavior they observed mid-training: in some rollouts, the model writes out a reasoning chain, then writes some variant of “wait — let me reconsider,” and revises. The paper calls this the “aha moment” and notes it emerged without explicit training signal for self-correction.

Inference: this is the same shape of behavior o1-class models exhibit. We can confirm it from the paper for R1; we cannot confirm the mechanism is identical in the closed o-series models. It is plausible the underlying training signal is similar, but unconfirmed.

4. Cold-start SFT + multi-stage RL (the shipping R1)

R1-Zero’s outputs are hard to read — pure RL produces reasoning that’s effective but not formatted for humans. Per Section 3, the shipping R1 adds:

  1. A small amount of “cold-start” SFT data — a few thousand reasoning traces, written or curated by humans, to nudge the format.
  2. RL stage 1 — same as R1-Zero, focused on reasoning.
  3. SFT stage 2 — combine RL outputs with general task data.
  4. RL stage 2 — broaden beyond math/code.

The paper publishes specific filter ratios and a description of the data mix. The exact data is not released.

5. Distillation into smaller students

Per Section 4, DeepSeek distilled R1 into six smaller students — Qwen and Llama variants. They generated ~800K reasoning traces from R1, did SFT on each base, and shipped the resulting models.

This is a published recipe — the most concrete one in the field for “take a reasoning model, distill it cheaply.” It maps directly onto /ship/17 and /case-studies/05. They use response distillation (SFT on generated traces), not logit distillation, for the same reasons Phi-3 does.

The distilled models, per the paper’s eval tables, retain the majority of R1’s reasoning ability. The 32B and 70B distilled models are within striking distance of R1 itself on AIME and MATH-500; the smaller ones lag more.

The recipe in curriculum language

Step in /articles/10-fine-tuningDeepSeek-R1 equivalent
Base modelDeepSeek-V3-base (MoE, 671B/37B)
RL algorithmGRPO (no value model)
Reward functionVerifiable: numeric match (math), test-pass (code)
Cold-start SFT~few thousand traces, hand-curated
Multi-stage post-trainingRL → SFT → RL, 4 stages total
DistillationResponse distillation, ~800K traces, six student bases
Distillation flavorSFT on generated traces (not logit KL)

Verifiable rewards is the move that makes pure RL viable. The paper doesn’t claim GRPO would work for tasks without an objective grader; it explicitly scopes the result to math and code.

Reproducibility status

  • Compute. R1’s training is built on top of V3, whose published cost is approximately $5.6M at hourly rental rates (V3 paper, January 2025). R1’s incremental RL training is reported at smaller scale; the paper does not state a single dollar figure but the compute description corresponds to weeks on a several-hundred-H800 cluster.
  • Data. The cold-start data and full RL training prompts are not released.
  • Tooling. Standard. The DeepSeek codebase for inference is open; training code is partially documented.
  • Realistic for a frontier lab? Done.
  • Realistic for a well-funded startup? The full R1 reproduction, no — needs the V3-class base model. The distillation step (R1 → smaller open base), absolutely yes; the recipe is published.
  • Realistic for an academic group? The GRPO-on-small-model variant has been reproduced by several open-source efforts (Open-R1, others). Replicating R1 itself is not yet feasible at academic compute budgets.
  • Realistic for a hobbyist? Run the distilled 1.5B or 7B locally — yes, immediately. Train one yourself — no.

What’s still confidential

  • Full RL training data (the math/code problem set).
  • The specific cold-start SFT examples.
  • Reward-function code (only the high-level shape is described).
  • Infrastructure details beyond rough cluster scale.
  • Many hyperparameters for the multi-stage pipeline.

The paper is one of the most open frontier-tier releases ever. It’s still not a recipe you can run end-to-end without reconstruction.

What’s changed since

  • Open replications — Open-R1 (HuggingFace) and others have reproduced the GRPO-on-small-model recipe, validating the algorithm. Full R1 reproduction is still ongoing as of this writing.
  • DeepSeek-R1-Distill family adoption — the distilled checkpoints (especially Qwen-32B-distill and Llama-70B-distill) have become standard baselines for “reasoning at home.”
  • GRPO is now mainstream — TRL and other RLHF libraries have first-class GRPO support; it’s the default for verifiable-reward training in 2026.

What this teaches you

Read this with /articles/07-modern-llms/reasoning-models, /articles/10-fine-tuning/rlhf-dpo-grpo, and /articles/10-fine-tuning/distillation:

  1. Reasoning behavior is trainable from RL alone, when the reward is verifiable. R1-Zero is the existence proof.
  2. GRPO is now the default for verifiable-reward RL. Simpler than PPO, no value model, works at scale.
  3. The “aha moment” is real and reproducible. It’s documented in a published paper with traces, not a marketing claim.
  4. Distillation of reasoning works. The published recipe (~800K traces, SFT on a smaller base) is the most concrete reasoning-distillation evidence in the field.
  5. You can use the distilled R1 students today. They’re MIT-licensed open weights. The 32B variant is competitive with the closed o-series on math benchmarks.

Further reading

Reasoning models are a 2024–2025 phenomenon, so book-length references on them specifically don’t exist yet. These are the foundations under the DeepSeek-R1 paper.

  • “Reinforcement Learning: An Introduction” by Richard Sutton and Andrew Barto (MIT Press, 2nd ed., 2018) — the canonical RL textbook. Free PDF authorized by the authors. Chapters 13 (policy gradient) and the surrounding material are the math behind PPO and GRPO. Without this, GRPO is a black box.
  • “Deep Reinforcement Learning Hands-On” by Maxim Lapan (Packt, 3rd ed., 2024) — the practical companion. PyTorch implementations of policy-gradient methods, including PPO. GRPO is a recent enough variant that it likely isn’t in book form anywhere yet.
  • “AI Engineering” by Chip Huyen (O’Reilly, 2024) — the production-deployment angle: how to evaluate and serve reasoning models, how to budget for their token cost.
  • The companion paper DeepSeekMath (arxiv:2402.03300) — where GRPO was introduced. Read it before R1 if you want to understand the algorithm in its original setting.
  • “Spinning Up in Deep RL” (free, OpenAI) — older, but the policy-gradient explanations are still some of the clearest available.

See also