Synthetic Data

Use AI to generate training data for AI. By 2026, synthetic data is a major lever in frontier model training, fine-tuning, and evaluation. Done well, it’s transformative; done poorly, it’s a slow-motion data poisoning event.

Why synthetic data

Scarcity: real data for some tasks is rare or expensive (rare languages, edge cases, regulated domains).
Privacy: real data may have PII; synthetic preserves structure without exposure.
Coverage: generate data that covers cases real data underrepresents.
Cost: cheaper than human labeling for many tasks.
Speed: generate millions of examples in hours, not months.

Where it shows up

Pretraining

Frontier models include synthetic data in pretraining mixes:

Phi-3 / Phi-4: heavily synthetic, “textbook quality” data.
Math reasoning datasets: generate problems + step-by-step solutions.
Code corpora: synthesized code with verified test pass.
Multilingual: translate / paraphrase to scale low-resource languages.

Fine-tuning data

For SFT and DPO:

Instruction-tuning: a strong model generates (instruction, response) pairs.
Distillation: capture traces of a frontier model on your tasks.
Preference data: model A and model B produce; a third model judges.
Domain-specific tuning: synthesize medical / legal / finance Q&A.

RAG and retrieval

Synthetic queries for embedding fine-tuning (Stage 10).
Synthetic eval sets: generate diverse questions over your corpus.

Evaluation

Adversarial test sets: prompt a model to find failure cases.
Edge case generation: synthesize unusual but realistic inputs.
Persona-conditioned eval: simulate users of different demographics, expertise.

Quality matters

Garbage synthetic data is worse than no data. What “quality” means:

Correctness: outputs are actually right.
Diversity: not 1000 paraphrases of the same example.
Difficulty: covers easy and hard.
Distribution: matches the input distribution you’ll see at deployment.
Format consistency: structurally clean.

Generation patterns

Distillation

Take a strong “teacher” model; collect (input, output) traces; train a smaller “student” on them.

for prompt in real_or_synthetic_prompts:
    response = teacher.complete(prompt)
    if quality_check(response):
        dataset.append({"prompt": prompt, "response": response})

train(student, dataset)

Modern open-source models (e.g. R1-Distill, OpenHermes) are often distilled from frontier closed models.

Self-instruct

Wang et al. (2022). Bootstrap from a few seed examples; the model generates more.

seed = [...50 hand-written examples...]
for _ in range(N):
    sampled = sample(seed, 5)
    new_example = llm("Generate 5 more like these:\n" + serialize(sampled))
    seed.append(new_example)

The original Stanford Alpaca dataset followed this pattern.

Verified generation

For tasks with a verifier (math, code, logic):

for problem in problems:
    candidates = [llm.generate(problem) for _ in range(n)]
    valid = [c for c in candidates if verify(c)]
    if valid:
        dataset.append((problem, valid[0]))

DeepSeek’s R1 used a variant for math: generate many solutions; keep ones with correct final answers.

LLM-as-judge filtering

Generate freely; filter aggressively:

for x in raw_synthetic_data:
    score = judge_llm(f"Rate the quality: {x}")
    if score >= threshold:
        dataset.append(x)

Quality varies; the judge model is itself imperfect.

Persona / role-conditioned

Simulate diverse users:

Generate questions a [novice / expert / skeptic / journalist / 12-year-old]
might ask about [topic].

Useful for breadth.

Adversarial generation

Have one model try to break another:

Generate prompts that would cause a customer service bot to fail
(misunderstand, refuse incorrectly, hallucinate).

Good for stress tests.

Pitfalls

Mode collapse

Generated data clusters around common patterns; real diversity is lost. Mitigations:

High temperature.
Prompt-level diversity (“make these very different from one another”).
Diverse seed examples.
Verify output diversity (e.g. embedding distances).

Quality drift

The teacher model has biases / errors; the student inherits and may amplify them. Mitigations:

Multiple teachers (model diversity).
Human spot-checks on samples.
Held-out human-labeled eval to catch drift.

Recursive degradation

Train model A on synthetic from model B; train C on synthetic from A; … Quality degrades fast. Avoid recursive cycles where synthetic data trains the same model that generated it without ground-truth grounding.

Some research (Shumailov et al. 2023, “model collapse”) shows pure recursive synthetic data leads to distribution collapse over generations.

Distribution mismatch

Synthetic data looks unlike production data. Symptoms:

Synthetic prompts use stilted language; real users don’t.
Synthetic data is “too perfect” — no typos, no shorthand, no implicit context.

Mitigations:

Mix synthetic with real data.
Sample real production traffic to drive synthesis.
Validate on real-data eval sets.

Overconfident filters

Filtering synthetic data with an LLM-judge that’s biased can produce a biased dataset. The filter inherits its own training-data quirks.

Privacy-preserving synthetic data

For regulated domains:

Differential privacy: train the generator with DP guarantees.
Schema-only synthesis: preserve structure (column names, distributions) without copying records.
Synthetic personae: generate plausible non-existent individuals from aggregate stats.

Tools: Mostly AI, Gretel, Tonic for tabular synthetic data.

Synthetic data isn’t automatically privacy-preserving. A synthetic dataset that closely mimics real outliers can leak.

Data curation pipelines

In practice, synthetic data generation = a pipeline:

Seed data
    ↓
Generate variants (LLM)
    ↓
Quality filter (LLM-judge or heuristics)
    ↓
Deduplicate (embedding similarity)
    ↓
Difficulty/topic balance (clustering, sampling)
    ↓
Format validation (schema check)
    ↓
Hold-out for eval (don't train on)
    ↓
Mix with real data → train

Each stage discards a lot of generated data. Don’t be surprised if 80% of raw synthetic gets filtered.

Specific high-leverage cases

Long-tail languages

Translate English instruction data; verify with native-speaker LLMs.

Structured extraction

For each known schema, generate (text, extracted_fields) pairs by reverse-engineering from random fields.

Bug hunting

Generate edge-case inputs for a model to find failure modes.

Tool-use traces

Have a strong model perform tasks with tools; collect (state, action) pairs to train tool-use.

Coding

Generate programming problems + verified test cases + reference solutions.

Math

Generate problems by transforming known problems (change numbers, rephrase, add steps).

Evaluation

How do you know your synthetic dataset is good?

Train a small model on it → does test performance improve?
Diversity metrics: pairwise embedding distances, n-gram overlap.
Human spot-check: review 100 random samples.
Sanity baselines: synthetic should at least beat training on real data alone (unless you’re augmenting, in which case the augmented set should beat real-only).

Practical advice

Start with quality, not quantity. 1k high-quality synthetic examples beat 100k garbage.
Mix synthetic with real. Synthetic alone risks distribution drift.
Use a stronger model than the target. Distillation from a more capable model is the most reliable pattern.
Verify with humans periodically. Sample 50 examples, hand-review.
Don’t recursively train without ground-truth grounding. Cycles degrade.
Track licensing. Output of commercial LLMs may have terms restricting use as training data; check.

Synthetic Data

Why synthetic data

Where it shows up

Pretraining

Fine-tuning data

RAG and retrieval

Evaluation

Quality matters

Generation patterns

Distillation

Self-instruct

Verified generation

LLM-as-judge filtering

Persona / role-conditioned

Adversarial generation

Pitfalls

Mode collapse

Quality drift

Recursive degradation

Distribution mismatch

Overconfident filters

Privacy-preserving synthetic data

Data curation pipelines

Specific high-leverage cases

Long-tail languages

Structured extraction

Bug hunting

Tool-use traces

Coding

Math

Evaluation

Practical advice

See also