Synthetic Data
Use AI to generate training data for AI. By 2026, synthetic data is a major lever in frontier model training, fine-tuning, and evaluation. Done well, it’s transformative; done poorly, it’s a slow-motion data poisoning event.
Why synthetic data
- Scarcity: real data for some tasks is rare or expensive (rare languages, edge cases, regulated domains).
- Privacy: real data may have PII; synthetic preserves structure without exposure.
- Coverage: generate data that covers cases real data underrepresents.
- Cost: cheaper than human labeling for many tasks.
- Speed: generate millions of examples in hours, not months.
Where it shows up
Pretraining
Frontier models include synthetic data in pretraining mixes:
- Phi-3 / Phi-4: heavily synthetic, “textbook quality” data.
- Math reasoning datasets: generate problems + step-by-step solutions.
- Code corpora: synthesized code with verified test pass.
- Multilingual: translate / paraphrase to scale low-resource languages.
Fine-tuning data
For SFT and DPO:
- Instruction-tuning: a strong model generates (instruction, response) pairs.
- Distillation: capture traces of a frontier model on your tasks.
- Preference data: model A and model B produce; a third model judges.
- Domain-specific tuning: synthesize medical / legal / finance Q&A.
RAG and retrieval
- Synthetic queries for embedding fine-tuning (Stage 10).
- Synthetic eval sets: generate diverse questions over your corpus.
Evaluation
- Adversarial test sets: prompt a model to find failure cases.
- Edge case generation: synthesize unusual but realistic inputs.
- Persona-conditioned eval: simulate users of different demographics, expertise.
Quality matters
Garbage synthetic data is worse than no data. What “quality” means:
- Correctness: outputs are actually right.
- Diversity: not 1000 paraphrases of the same example.
- Difficulty: covers easy and hard.
- Distribution: matches the input distribution you’ll see at deployment.
- Format consistency: structurally clean.
Generation patterns
Distillation
Take a strong “teacher” model; collect (input, output) traces; train a smaller “student” on them.
for prompt in real_or_synthetic_prompts:
response = teacher.complete(prompt)
if quality_check(response):
dataset.append({"prompt": prompt, "response": response})
train(student, dataset)
Modern open-source models (e.g. R1-Distill, OpenHermes) are often distilled from frontier closed models.
Self-instruct
Wang et al. (2022). Bootstrap from a few seed examples; the model generates more.
seed = [...50 hand-written examples...]
for _ in range(N):
sampled = sample(seed, 5)
new_example = llm("Generate 5 more like these:\n" + serialize(sampled))
seed.append(new_example)
The original Stanford Alpaca dataset followed this pattern.
Verified generation
For tasks with a verifier (math, code, logic):
for problem in problems:
candidates = [llm.generate(problem) for _ in range(n)]
valid = [c for c in candidates if verify(c)]
if valid:
dataset.append((problem, valid[0]))
DeepSeek’s R1 used a variant for math: generate many solutions; keep ones with correct final answers.
LLM-as-judge filtering
Generate freely; filter aggressively:
for x in raw_synthetic_data:
score = judge_llm(f"Rate the quality: {x}")
if score >= threshold:
dataset.append(x)
Quality varies; the judge model is itself imperfect.
Persona / role-conditioned
Simulate diverse users:
Generate questions a [novice / expert / skeptic / journalist / 12-year-old]
might ask about [topic].
Useful for breadth.
Adversarial generation
Have one model try to break another:
Generate prompts that would cause a customer service bot to fail
(misunderstand, refuse incorrectly, hallucinate).
Good for stress tests.
Pitfalls
Mode collapse
Generated data clusters around common patterns; real diversity is lost. Mitigations:
- High temperature.
- Prompt-level diversity (“make these very different from one another”).
- Diverse seed examples.
- Verify output diversity (e.g. embedding distances).
Quality drift
The teacher model has biases / errors; the student inherits and may amplify them. Mitigations:
- Multiple teachers (model diversity).
- Human spot-checks on samples.
- Held-out human-labeled eval to catch drift.
Recursive degradation
Train model A on synthetic from model B; train C on synthetic from A; … Quality degrades fast. Avoid recursive cycles where synthetic data trains the same model that generated it without ground-truth grounding.
Some research (Shumailov et al. 2023, “model collapse”) shows pure recursive synthetic data leads to distribution collapse over generations.
Distribution mismatch
Synthetic data looks unlike production data. Symptoms:
- Synthetic prompts use stilted language; real users don’t.
- Synthetic data is “too perfect” — no typos, no shorthand, no implicit context.
Mitigations:
- Mix synthetic with real data.
- Sample real production traffic to drive synthesis.
- Validate on real-data eval sets.
Overconfident filters
Filtering synthetic data with an LLM-judge that’s biased can produce a biased dataset. The filter inherits its own training-data quirks.
Privacy-preserving synthetic data
For regulated domains:
- Differential privacy: train the generator with DP guarantees.
- Schema-only synthesis: preserve structure (column names, distributions) without copying records.
- Synthetic personae: generate plausible non-existent individuals from aggregate stats.
Tools: Mostly AI, Gretel, Tonic for tabular synthetic data.
Synthetic data isn’t automatically privacy-preserving. A synthetic dataset that closely mimics real outliers can leak.
Data curation pipelines
In practice, synthetic data generation = a pipeline:
Seed data
↓
Generate variants (LLM)
↓
Quality filter (LLM-judge or heuristics)
↓
Deduplicate (embedding similarity)
↓
Difficulty/topic balance (clustering, sampling)
↓
Format validation (schema check)
↓
Hold-out for eval (don't train on)
↓
Mix with real data → train
Each stage discards a lot of generated data. Don’t be surprised if 80% of raw synthetic gets filtered.
Specific high-leverage cases
Long-tail languages
Translate English instruction data; verify with native-speaker LLMs.
Structured extraction
For each known schema, generate (text, extracted_fields) pairs by reverse-engineering from random fields.
Bug hunting
Generate edge-case inputs for a model to find failure modes.
Tool-use traces
Have a strong model perform tasks with tools; collect (state, action) pairs to train tool-use.
Coding
Generate programming problems + verified test cases + reference solutions.
Math
Generate problems by transforming known problems (change numbers, rephrase, add steps).
Evaluation
How do you know your synthetic dataset is good?
- Train a small model on it → does test performance improve?
- Diversity metrics: pairwise embedding distances, n-gram overlap.
- Human spot-check: review 100 random samples.
- Sanity baselines: synthetic should at least beat training on real data alone (unless you’re augmenting, in which case the augmented set should beat real-only).
Practical advice
- Start with quality, not quantity. 1k high-quality synthetic examples beat 100k garbage.
- Mix synthetic with real. Synthetic alone risks distribution drift.
- Use a stronger model than the target. Distillation from a more capable model is the most reliable pattern.
- Verify with humans periodically. Sample 50 examples, hand-review.
- Don’t recursively train without ground-truth grounding. Cycles degrade.
- Track licensing. Output of commercial LLMs may have terms restricting use as training data; check.