RLHF, DPO, GRPO — Preference and Reward Training

SFT teaches the model to imitate. Preference and reward methods teach the model to prefer certain outputs over others, or to score high on a reward signal. This is what turns a competent text generator into a usable assistant.

RLHF (Reinforcement Learning from Human Feedback)

The original recipe (Christiano et al. 2017, Stiennon et al. 2020, Ouyang et al. 2022 / InstructGPT).

Three stages

Stage 1 — SFT. Train a model on (prompt, ideal_response) pairs.

Stage 2 — Reward model. Collect pairs of responses to the same prompt; humans (or an LLM) say which is better. Train a model R(prompt, response) → ℝ to score responses.

Stage 3 — RL fine-tuning (PPO). Use PPO to maximize reward R(prompt, model(prompt)) while staying close to the SFT model (KL penalty):

maximize  E[R(p, y) − β · KL(π_θ(·|p) || π_SFT(·|p))]

The KL penalty is critical — without it, the model “reward-hacks” by producing degenerate text that scores high.

The pain of PPO

PPO is:

Sample-inefficient (lots of rollouts).
Memory-heavy (need value model, reward model, reference model in addition to the policy).
Hyperparameter-sensitive.
Hard to debug.

Doing classic RLHF requires a small ML team. This is why DPO became popular.

DPO (Direct Preference Optimization)

Rafailov et al. (2023). The clever trick: you can skip the reward model. The optimal reward model can be expressed as a function of the policy itself, so train directly on preference pairs:

Loss = −E_(p,y_w,y_l)~D [ log σ( β · (log π_θ(y_w|p) − log π_ref(y_w|p))
                                  − β · (log π_θ(y_l|p) − log π_ref(y_l|p)) ) ]

In English: maximize the probability of the preferred response and minimize the rejected one, weighted by the log-ratio against a reference model.

Why DPO works

No reward model.
No PPO.
No rollouts — just supervised-style training on preference pairs.
Closed-form, stable, easy.

DPO recipe

Data: {(prompt, chosen, rejected)} pairs. Often constructed from:

Human preferences.
LLM-generated preferences (RLAIF).
Heuristic rules (“the response that includes citations wins”).

from trl import DPOTrainer, DPOConfig

config = DPOConfig(
    output_dir="./dpo_out",
    beta=0.1,                    # KL strength
    learning_rate=5e-7,          # very low!
    num_train_epochs=1,
    bf16=True,
)

trainer = DPOTrainer(
    model=base_model,
    ref_model=ref_model,         # frozen reference (often base or SFT)
    args=config,
    train_dataset=dpo_dataset,
    tokenizer=tokenizer,
)
trainer.train()

Hyperparameters:

β (beta): KL strength. Higher = stays closer to ref model. Common: 0.1–0.5.
Learning rate: very low — 1e-7 to 1e-6 for full DPO; higher for LoRA-DPO.
Epochs: usually 1. More = overfit.

Variants

IPO: Identity Preference Optimization. Variation of DPO with a different loss.
KTO: Kahneman-Tversky Optimization. Uses single positive/negative labels instead of pairs.
ORPO: Odds Ratio Preference Optimization. Combines SFT and preference into one loss.
SimPO: Simple Preference Optimization. Drops the reference model.

For most teams: start with DPO. Move to ORPO if you want to skip the SFT step. KTO if your data is pointwise rather than pairwise.

GRPO (Group Relative Policy Optimization)

DeepSeek (2024) — the engine behind R1’s reasoning.

The idea

For a given prompt, generate N candidate responses at high temperature. Score each (e.g. with a reward model or a verifier — for math, that’s “is the answer correct?”). Compute the relative advantage of each response within the group:

A_i = (R_i − mean(R)) / std(R)

Update the policy to increase the probability of high-advantage responses, decrease low-advantage ones.

Loss = −E_(p, y_i)~rollouts [ A_i · log π_θ(y_i | p) − β · KL(π_θ || π_ref) ]

Why it’s a big deal

No separate value model (PPO needs one).
Group baseline = built-in variance reduction.
Works beautifully with verifiable rewards (math correct/incorrect, code passes tests).
The training that produced o1-style reasoning behavior in R1.

For tasks with clear correctness signals, GRPO is the go-to RL method in 2026.

When to use GRPO

Math problem-solving (AIME, MATH benchmarks).
Code (correctness checked against tests).
Logic/reasoning tasks with verifiable answers.
Any task with a programmable reward function.

For free-form chat or alignment, DPO/RLHF still wins because human-style preferences don’t reduce to a verifier.

RLAIF (RL from AI Feedback)

Use a strong LLM (Claude, GPT-4) as the labeler instead of humans. Cheaper, faster, scales. Quality often comparable to RLHF for many tasks.

The Constitutional AI variant (Anthropic): the LLM critiques its own outputs against a set of principles, generating preference data. Used to align Claude.

Reward hacking

Any RL-style method risks reward hacking — the model finds ways to score high without doing what you actually wanted.

Examples:

Model adds “I am a helpful assistant” to every response (the reward model loves it).
Model writes long responses (reward correlates with length, accidentally).
Model uses specific buzzwords learned to please the reward model.

Mitigations:

KL penalty (built into all these methods) keeps the model close to a reference.
Eval beyond the reward model — check actual user-facing quality.
Reward models with calibration regularization to stay grounded.
Periodic retraining of the reward model on fresh preferences.
For GRPO with verifiers: choose verifiers carefully so they can’t be gamed (e.g. test cases must be hidden).

Process reward models (PRMs)

For multi-step reasoning, score each intermediate step, not just the final answer. Used to train models to follow good reasoning paths, not just produce correct final answers.

OpenAI’s o1, DeepSeek’s R1 work, and others use PRM-related ideas at varying intensity.

Comparison table

Method	Training data	Compute	Use case
SFT	(prompt, response)	Low	Format, voice, baseline behavior
RLHF (PPO)	reward model + rollouts	High	Alignment, helpfulness
DPO	(prompt, chosen, rejected)	Low	Alignment, simpler than RLHF
KTO	(prompt, response, label)	Low	Pointwise feedback data
ORPO	(prompt, chosen, rejected)	Low	Combine SFT + DPO in one stage
GRPO	reward function or verifier	Medium	Reasoning with verifiable rewards
RLAIF	LLM-judged preferences	Low–medium	Cheaper alignment when data scarce

Practical recipe (early 2026)

For most teams wanting to align an open-source model:

SFT on 1k–100k high-quality instruction examples.
DPO on 1k–10k preference pairs (LLM-judged or human-labeled).
(Optional) GRPO if you have verifiable tasks.
Eval, ship.

Skip stage 1 if you start from a chat-tuned model. Skip stage 2 if your task doesn’t need preference alignment. GRPO is heavy and only worth it for hard reasoning tasks.

Real-world case studies

Field report: Llama 3 — Meta’s published 92-page post-training paper; iterative DPO with rejection sampling at frontier scale. The real-world worked example for everything in this article.
Field report: DeepSeek-R1 — GRPO at frontier scale with verifiable rewards; pure-RL reasoning training. The published recipe for what o1-class models look like from the open side.