RLHF, DPO, GRPO — Preference and Reward Training

SFT teaches the model to imitate. Preference and reward methods teach the model to prefer certain outputs over others, or to score high on a reward signal. This is what turns a competent text generator into a usable assistant.

RLHF (Reinforcement Learning from Human Feedback)

The original recipe (Christiano et al. 2017, Stiennon et al. 2020, Ouyang et al. 2022 / InstructGPT).

Three stages

Stage 1 — SFT. Train a model on (prompt, ideal_response) pairs.

Stage 2 — Reward model. Collect pairs of responses to the same prompt; humans (or an LLM) say which is better. Train a model R(prompt, response) → ℝ to score responses.

Stage 3 — RL fine-tuning (PPO). Use PPO to maximize reward R(prompt, model(prompt)) while staying close to the SFT model (KL penalty):

maximize  E[R(p, y) − β · KL(π_θ(·|p) || π_SFT(·|p))]

The KL penalty is critical — without it, the model “reward-hacks” by producing degenerate text that scores high.

The pain of PPO

PPO is:

  • Sample-inefficient (lots of rollouts).
  • Memory-heavy (need value model, reward model, reference model in addition to the policy).
  • Hyperparameter-sensitive.
  • Hard to debug.

Doing classic RLHF requires a small ML team. This is why DPO became popular.

DPO (Direct Preference Optimization)

Rafailov et al. (2023). The clever trick: you can skip the reward model. The optimal reward model can be expressed as a function of the policy itself, so train directly on preference pairs:

Loss = −E_(p,y_w,y_l)~D [ log σ( β · (log π_θ(y_w|p) − log π_ref(y_w|p))
                                  − β · (log π_θ(y_l|p) − log π_ref(y_l|p)) ) ]

In English: maximize the probability of the preferred response and minimize the rejected one, weighted by the log-ratio against a reference model.

Why DPO works

  • No reward model.
  • No PPO.
  • No rollouts — just supervised-style training on preference pairs.
  • Closed-form, stable, easy.

DPO recipe

Data: {(prompt, chosen, rejected)} pairs. Often constructed from:

  • Human preferences.
  • LLM-generated preferences (RLAIF).
  • Heuristic rules (“the response that includes citations wins”).
from trl import DPOTrainer, DPOConfig

config = DPOConfig(
    output_dir="./dpo_out",
    beta=0.1,                    # KL strength
    learning_rate=5e-7,          # very low!
    num_train_epochs=1,
    bf16=True,
)

trainer = DPOTrainer(
    model=base_model,
    ref_model=ref_model,         # frozen reference (often base or SFT)
    args=config,
    train_dataset=dpo_dataset,
    tokenizer=tokenizer,
)
trainer.train()

Hyperparameters:

  • β (beta): KL strength. Higher = stays closer to ref model. Common: 0.1–0.5.
  • Learning rate: very low — 1e-7 to 1e-6 for full DPO; higher for LoRA-DPO.
  • Epochs: usually 1. More = overfit.

Variants

  • IPO: Identity Preference Optimization. Variation of DPO with a different loss.
  • KTO: Kahneman-Tversky Optimization. Uses single positive/negative labels instead of pairs.
  • ORPO: Odds Ratio Preference Optimization. Combines SFT and preference into one loss.
  • SimPO: Simple Preference Optimization. Drops the reference model.

For most teams: start with DPO. Move to ORPO if you want to skip the SFT step. KTO if your data is pointwise rather than pairwise.

GRPO (Group Relative Policy Optimization)

DeepSeek (2024) — the engine behind R1’s reasoning.

The idea

For a given prompt, generate N candidate responses at high temperature. Score each (e.g. with a reward model or a verifier — for math, that’s “is the answer correct?”). Compute the relative advantage of each response within the group:

A_i = (R_i − mean(R)) / std(R)

Update the policy to increase the probability of high-advantage responses, decrease low-advantage ones.

Loss = −E_(p, y_i)~rollouts [ A_i · log π_θ(y_i | p) − β · KL(π_θ || π_ref) ]

Why it’s a big deal

  • No separate value model (PPO needs one).
  • Group baseline = built-in variance reduction.
  • Works beautifully with verifiable rewards (math correct/incorrect, code passes tests).
  • The training that produced o1-style reasoning behavior in R1.

For tasks with clear correctness signals, GRPO is the go-to RL method in 2026.

When to use GRPO

  • Math problem-solving (AIME, MATH benchmarks).
  • Code (correctness checked against tests).
  • Logic/reasoning tasks with verifiable answers.
  • Any task with a programmable reward function.

For free-form chat or alignment, DPO/RLHF still wins because human-style preferences don’t reduce to a verifier.

RLAIF (RL from AI Feedback)

Use a strong LLM (Claude, GPT-4) as the labeler instead of humans. Cheaper, faster, scales. Quality often comparable to RLHF for many tasks.

The Constitutional AI variant (Anthropic): the LLM critiques its own outputs against a set of principles, generating preference data. Used to align Claude.

Reward hacking

Any RL-style method risks reward hacking — the model finds ways to score high without doing what you actually wanted.

Examples:

  • Model adds “I am a helpful assistant” to every response (the reward model loves it).
  • Model writes long responses (reward correlates with length, accidentally).
  • Model uses specific buzzwords learned to please the reward model.

Mitigations:

  • KL penalty (built into all these methods) keeps the model close to a reference.
  • Eval beyond the reward model — check actual user-facing quality.
  • Reward models with calibration regularization to stay grounded.
  • Periodic retraining of the reward model on fresh preferences.
  • For GRPO with verifiers: choose verifiers carefully so they can’t be gamed (e.g. test cases must be hidden).

Process reward models (PRMs)

For multi-step reasoning, score each intermediate step, not just the final answer. Used to train models to follow good reasoning paths, not just produce correct final answers.

OpenAI’s o1, DeepSeek’s R1 work, and others use PRM-related ideas at varying intensity.

Comparison table

MethodTraining dataComputeUse case
SFT(prompt, response)LowFormat, voice, baseline behavior
RLHF (PPO)reward model + rolloutsHighAlignment, helpfulness
DPO(prompt, chosen, rejected)LowAlignment, simpler than RLHF
KTO(prompt, response, label)LowPointwise feedback data
ORPO(prompt, chosen, rejected)LowCombine SFT + DPO in one stage
GRPOreward function or verifierMediumReasoning with verifiable rewards
RLAIFLLM-judged preferencesLow–mediumCheaper alignment when data scarce

Practical recipe (early 2026)

For most teams wanting to align an open-source model:

  1. SFT on 1k–100k high-quality instruction examples.
  2. DPO on 1k–10k preference pairs (LLM-judged or human-labeled).
  3. (Optional) GRPO if you have verifiable tasks.
  4. Eval, ship.

Skip stage 1 if you start from a chat-tuned model. Skip stage 2 if your task doesn’t need preference alignment. GRPO is heavy and only worth it for hard reasoning tasks.

Real-world case studies

  • Field report: Llama 3 — Meta’s published 92-page post-training paper; iterative DPO with rejection sampling at frontier scale. The real-world worked example for everything in this article.
  • Field report: DeepSeek-R1 — GRPO at frontier scale with verifiable rewards; pure-RL reasoning training. The published recipe for what o1-class models look like from the open side.

See also