Synthetic Data

Use AI to generate training data for AI. By 2026, synthetic data is a major lever in frontier model training, fine-tuning, and evaluation. Done well, it’s transformative; done poorly, it’s a slow-motion data poisoning event.

Why synthetic data

  • Scarcity: real data for some tasks is rare or expensive (rare languages, edge cases, regulated domains).
  • Privacy: real data may have PII; synthetic preserves structure without exposure.
  • Coverage: generate data that covers cases real data underrepresents.
  • Cost: cheaper than human labeling for many tasks.
  • Speed: generate millions of examples in hours, not months.

Where it shows up

Pretraining

Frontier models include synthetic data in pretraining mixes:

  • Phi-3 / Phi-4: heavily synthetic, “textbook quality” data.
  • Math reasoning datasets: generate problems + step-by-step solutions.
  • Code corpora: synthesized code with verified test pass.
  • Multilingual: translate / paraphrase to scale low-resource languages.

Fine-tuning data

For SFT and DPO:

  • Instruction-tuning: a strong model generates (instruction, response) pairs.
  • Distillation: capture traces of a frontier model on your tasks.
  • Preference data: model A and model B produce; a third model judges.
  • Domain-specific tuning: synthesize medical / legal / finance Q&A.

RAG and retrieval

  • Synthetic queries for embedding fine-tuning (Stage 10).
  • Synthetic eval sets: generate diverse questions over your corpus.

Evaluation

  • Adversarial test sets: prompt a model to find failure cases.
  • Edge case generation: synthesize unusual but realistic inputs.
  • Persona-conditioned eval: simulate users of different demographics, expertise.

Quality matters

Garbage synthetic data is worse than no data. What “quality” means:

  • Correctness: outputs are actually right.
  • Diversity: not 1000 paraphrases of the same example.
  • Difficulty: covers easy and hard.
  • Distribution: matches the input distribution you’ll see at deployment.
  • Format consistency: structurally clean.

Generation patterns

Distillation

Take a strong “teacher” model; collect (input, output) traces; train a smaller “student” on them.

for prompt in real_or_synthetic_prompts:
    response = teacher.complete(prompt)
    if quality_check(response):
        dataset.append({"prompt": prompt, "response": response})

train(student, dataset)

Modern open-source models (e.g. R1-Distill, OpenHermes) are often distilled from frontier closed models.

Self-instruct

Wang et al. (2022). Bootstrap from a few seed examples; the model generates more.

seed = [...50 hand-written examples...]
for _ in range(N):
    sampled = sample(seed, 5)
    new_example = llm("Generate 5 more like these:\n" + serialize(sampled))
    seed.append(new_example)

The original Stanford Alpaca dataset followed this pattern.

Verified generation

For tasks with a verifier (math, code, logic):

for problem in problems:
    candidates = [llm.generate(problem) for _ in range(n)]
    valid = [c for c in candidates if verify(c)]
    if valid:
        dataset.append((problem, valid[0]))

DeepSeek’s R1 used a variant for math: generate many solutions; keep ones with correct final answers.

LLM-as-judge filtering

Generate freely; filter aggressively:

for x in raw_synthetic_data:
    score = judge_llm(f"Rate the quality: {x}")
    if score >= threshold:
        dataset.append(x)

Quality varies; the judge model is itself imperfect.

Persona / role-conditioned

Simulate diverse users:

Generate questions a [novice / expert / skeptic / journalist / 12-year-old]
might ask about [topic].

Useful for breadth.

Adversarial generation

Have one model try to break another:

Generate prompts that would cause a customer service bot to fail
(misunderstand, refuse incorrectly, hallucinate).

Good for stress tests.

Pitfalls

Mode collapse

Generated data clusters around common patterns; real diversity is lost. Mitigations:

  • High temperature.
  • Prompt-level diversity (“make these very different from one another”).
  • Diverse seed examples.
  • Verify output diversity (e.g. embedding distances).

Quality drift

The teacher model has biases / errors; the student inherits and may amplify them. Mitigations:

  • Multiple teachers (model diversity).
  • Human spot-checks on samples.
  • Held-out human-labeled eval to catch drift.

Recursive degradation

Train model A on synthetic from model B; train C on synthetic from A; … Quality degrades fast. Avoid recursive cycles where synthetic data trains the same model that generated it without ground-truth grounding.

Some research (Shumailov et al. 2023, “model collapse”) shows pure recursive synthetic data leads to distribution collapse over generations.

Distribution mismatch

Synthetic data looks unlike production data. Symptoms:

  • Synthetic prompts use stilted language; real users don’t.
  • Synthetic data is “too perfect” — no typos, no shorthand, no implicit context.

Mitigations:

  • Mix synthetic with real data.
  • Sample real production traffic to drive synthesis.
  • Validate on real-data eval sets.

Overconfident filters

Filtering synthetic data with an LLM-judge that’s biased can produce a biased dataset. The filter inherits its own training-data quirks.

Privacy-preserving synthetic data

For regulated domains:

  • Differential privacy: train the generator with DP guarantees.
  • Schema-only synthesis: preserve structure (column names, distributions) without copying records.
  • Synthetic personae: generate plausible non-existent individuals from aggregate stats.

Tools: Mostly AI, Gretel, Tonic for tabular synthetic data.

Synthetic data isn’t automatically privacy-preserving. A synthetic dataset that closely mimics real outliers can leak.

Data curation pipelines

In practice, synthetic data generation = a pipeline:

Seed data

Generate variants (LLM)

Quality filter (LLM-judge or heuristics)

Deduplicate (embedding similarity)

Difficulty/topic balance (clustering, sampling)

Format validation (schema check)

Hold-out for eval (don't train on)

Mix with real data → train

Each stage discards a lot of generated data. Don’t be surprised if 80% of raw synthetic gets filtered.

Specific high-leverage cases

Long-tail languages

Translate English instruction data; verify with native-speaker LLMs.

Structured extraction

For each known schema, generate (text, extracted_fields) pairs by reverse-engineering from random fields.

Bug hunting

Generate edge-case inputs for a model to find failure modes.

Tool-use traces

Have a strong model perform tasks with tools; collect (state, action) pairs to train tool-use.

Coding

Generate programming problems + verified test cases + reference solutions.

Math

Generate problems by transforming known problems (change numbers, rephrase, add steps).

Evaluation

How do you know your synthetic dataset is good?

  • Train a small model on it → does test performance improve?
  • Diversity metrics: pairwise embedding distances, n-gram overlap.
  • Human spot-check: review 100 random samples.
  • Sanity baselines: synthetic should at least beat training on real data alone (unless you’re augmenting, in which case the augmented set should beat real-only).

Practical advice

  1. Start with quality, not quantity. 1k high-quality synthetic examples beat 100k garbage.
  2. Mix synthetic with real. Synthetic alone risks distribution drift.
  3. Use a stronger model than the target. Distillation from a more capable model is the most reliable pattern.
  4. Verify with humans periodically. Sample 50 examples, hand-review.
  5. Don’t recursively train without ground-truth grounding. Cycles degrade.
  6. Track licensing. Output of commercial LLMs may have terms restricting use as training data; check.

See also