Field report: Llama 3 — frontier post-training, in 92 pages

Field report. Observational study based on published sources. All claims cite the original paper or official Meta releases. Inference is marked explicitly. As of 2026-05-01.

“The Llama 3 Herd of Models” (Meta, July 2024) is the most detailed frontier-model paper ever published. 92 pages, full pretraining, full post-training, full safety stack, with ablations. It’s the closest thing to a public post-training cookbook the field has, and it sits next to the curriculum’s /articles/10-fine-tuning/rlhf-dpo-grpo as the real-world worked example.

This is a field report, not a tutorial. The paper is sprawling; this article is the “what to actually pay attention to” guide for someone reading it after the curriculum.

What was released

Llama 3.1 family: 8B, 70B, 405B open-weight models. Llama 3 license (open-weight with use restrictions). July 2024.
Llama 3.2 (vision + smaller) and Llama 3.3 (70B updated) — incremental releases on the same family.
“The Llama 3 Herd of Models” (arxiv:2407.21783) — 92 pages, the canonical reference.

The headline claim: a 405B open-weight model competitive with closed frontier models of the era (GPT-4o, Claude 3.5 Sonnet) on standard benchmarks, released with the recipe — pretraining data shape, post-training pipeline, safety stack, ablations.

What the paper actually says

The paper is huge. Five things are worth pulling out for curriculum readers:

1. Pretraining at curriculum scale

Per Section 3, Llama 3 pretrains on 15.6T tokens of multilingual text. The data composition is discussed in broad strokes — proportions for code, multilingual, math, reasoning — but the exact corpus is not released. Per /articles/07-modern-llms/scaling-laws, this is well past Chinchilla-optimal for the 405B model; Llama 3 deliberately overtrains for inference-time efficiency.

2. The full post-training pipeline

Per Section 4, post-training is multi-stage and iterative:

SFT → reward model training → DPO → repeat

Several rounds of this loop, per the paper’s description, with each round refining the previous. This is the curriculum’s iterative DPO pattern at scale, with concrete published numbers on each round’s data volume and effect.

The critical detail: after each round, they use the current model to generate candidate completions, score them with a reward model + rule-based filters, and use the winners as the next round’s preference data. Llama eats its own cooking.

3. Synthetic data via rejection sampling

Per Section 4.3, a substantial fraction of the post-training data — especially for code, reasoning, and tool use — is synthetic, generated by prior Llama versions and filtered. The paper publishes:

The rejection-sampling procedure (sample N completions per prompt, keep the ones that pass execution / verification / reward-model thresholds).
Ablations showing synthetic-data quality vs quantity tradeoffs.
The specific filters used per data type (code: test pass; math: numeric match; reasoning: CoT verification).

In curriculum vocabulary, this is the /ship/17 pipeline scaled and turned inward — the teacher and student are the same model family, just different generations.

4. Llama Guard and the safety stack

Per Section 5, safety is operational, not bolted on:

Llama Guard 2/3 — separate small classifiers trained to flag unsafe prompts and responses. Open-weight, deployable independently.
Rejection sampling at training time — unsafe candidate completions are filtered out before reaching DPO.
Prompt-level filtering — safety prompts in SFT data train the model itself to refuse.
Iterative red-teaming — humans probe for jailbreaks, jailbreaks become next round’s training data.

The paper publishes the safety eval methodology and the false-positive / false-negative rates. This is the most concrete public reference for “what does a real production safety stack look like.”

5. Ablations on data quality vs quantity

Section 4 includes ablation tables that resolve the “is more data always better?” question with receipts:

For pretraining, more high-quality data dominates.
For post-training, quality dominates quantity emphatically. A 10× larger SFT mix with mediocre filtering performed worse than a smaller, harder-filtered mix.
For synthetic data, the rejection-sampling threshold matters more than the sample count.

These ablations are why the “textbooks > web scrape” thesis (Phi-3) and the “filter ruthlessly” thesis (Llama 3) converged in 2024.

The recipe in curriculum language

Step in /articles/10-fine-tuning	Llama 3 equivalent
Pretraining	15.6T tokens, deliberately over-Chinchilla
SFT	Round 1, ~10M filtered examples
Reward model	Trained on human + synthetic preference pairs
DPO	Iterative, several rounds, each on filtered self-generated data
Synthetic data	Rejection sampling from prior Llama versions, verifier-graded
Safety	Llama Guard + rejection sampling + prompt-level + red-team loop
Distillation	Implicit — 8B and 70B benefit from 405B’s signal in the iterative loop

The thing the curriculum understates and the Llama 3 paper makes explicit: post-training is a loop, not a one-shot. The shipping model is round-N of “generate → filter → train → eval → repeat,” not a single fine-tune.

Reproducibility status

Compute. Pretraining 405B on 15.6T tokens is reported by the paper as the equivalent of approximately 30M H100-hours. Post-training is much smaller but still substantial — multi-week jobs on hundreds of GPUs.
Data. Pretraining data is not released. Post-training synthetic data is described in mechanism but not released. Llama Guard data is partially released.
Tooling. Largely standard PyTorch / TorchTune. Meta has open-sourced training code for fine-tuning the released checkpoints; pretraining code is not directly released.
Realistic for a frontier lab? Yes — done. Several other labs have replicated the shape of this pipeline.
Realistic for a well-funded startup? Pretraining a 405B-class model: no. Post-training a Llama 3 base: absolutely yes, and many startups do.
Realistic for an academic group? Post-training experiments on the 8B base: feasible. The full 405B pipeline: out of reach.
Realistic for a hobbyist? Run the released checkpoints locally (8B easily, 70B with quantization). Replicate the pipeline: no.

What’s still confidential

The paper is unprecedented in detail, but several things are intentionally vague:

Pretraining data composition. Broad categories given; exact corpus not released.
Specific human-rater pool. The paper describes process; specific instructions and rater demographics aren’t published.
Some hyperparameters. Especially in the post-training rounds, where the numbers shift between rounds.
The full safety eval suite. Methodology is published; raw question sets are not.
Infrastructure beyond cluster scale. Networking, storage, scheduling — opaque.

These are the standard frontier-lab confidentiality boundaries. The training recipe is open; the training data and production infrastructure are not.

What’s changed since

Llama 3.2 (September 2024) — added vision-language variants, smaller models (1B, 3B). Same post-training shape.
Llama 3.3 (December 2024) — 70B refresh with continued post-training. Confirmed the iterative pipeline keeps yielding gains.
Llama 4 (April 2025) — MoE architecture, multimodal-native. Different enough that this field report still describes the canonical Llama 3 pipeline; a separate report would be needed for Llama 4.
Open replication — many post-Llama-3 open-weight models (Qwen, Mistral, others) have adopted the iterative-DPO + rejection-sampling pattern as standard.

What this teaches you

Read with /articles/10-fine-tuning/rlhf-dpo-grpo, /articles/10-fine-tuning/distillation, and /articles/11-agents/guardrails-and-safety:

Post-training is iterative. Plan for several DPO rounds, not one. The first round teaches format; later rounds teach quality.
The model trains itself, indirectly. Synthetic data from prior generations, rejection-sampled, becomes next-generation training data. This is the “self-improvement loop” at the heart of every frontier-lab post-2024.
Safety is a stack, not a setting. Classifier (Llama Guard) + filter (rejection sampling) + prompt training + red-team loop. Each piece catches what the others miss.
Quality > quantity is now empirical. Llama 3’s ablations and Phi-3’s results converge on the same point: a 10× larger mediocre dataset loses to a careful smaller one.
You can read the paper. It’s 92 pages. Sections 4 and 5 are the curriculum-relevant parts. Skip the pretraining sections unless you’re scaling-laws-curious.