tiny-llm 13 / 16 22 min read · 25 min hands-on

step 13 · build

Evaluate honestly

Three lenses on model quality — perplexity, generation samples, LLM-as-judge. None complete on its own; together they let you make decisions instead of vibes.

evaluation

You’ve trained a model. You’ve fine-tuned it. The training-loss curve looks reasonable. Is the model any good? That’s a different question, and one that takes more than a single number to answer.

This step builds a tiny eval harness with three independent lenses:

  1. Perplexity on a held-out set — the metric every paper opens with
  2. Generation samples — qualitative read-through of what the model actually produces
  3. Lightweight LLM-as-judge — programmatic grading using a stronger model (or a stricter rubric)

Each one is misleading in isolation. Perplexity is comparable across model sizes only within the same tokenizer. Generation samples are subjective. LLM-as-judge has its own biases. The trick is combining them so that when all three agree, you trust the result; when they disagree, you investigate.

Lens 1: Perplexity

We covered the formula in step 09’s training loop:

PPL = exp(cross_entropy_loss)

Or equivalently, 2^(loss in bits/token). PPL is the model’s branching factor — roughly, how many equally-likely next tokens the model is hesitating between. PPL = 1 means perfect prediction; PPL = vocab_size means uniform random.

For our setup:

StageApprox PPL
Untrained model4096 (uniform over our 4k vocab)
After step 09 (~5K steps)~5.5
After step 12 LoRA fine-tunedepends on task
GPT-2 small on TinyStories~3.0
GPT-4 on TinyStories~2.4

The PPL gap from 5.5 → 3 is what scaling buys you (step 11 numbers); the gap from 3 → 2.4 is the diminishing-returns territory at the top.

The eval helper from step 09 already computes this:

# tiny_llm/eval.py
import math
from pathlib import Path
import torch
import torch.nn.functional as F
from tiny_llm.gpt import GPT
from tiny_llm.data import load_token_array, get_batch, DATA_DIR


@torch.no_grad()
def perplexity(model: GPT, data, batch_size: int = 32, n_iters: int = 200) -> tuple[float, float]:
    """Average cross-entropy and perplexity over `n_iters` batches.

    Returns (loss_in_nats, perplexity).
    """
    model.eval()
    losses = []
    for _ in range(n_iters):
        x, y = get_batch(
            data,
            batch_size=batch_size,
            seq_len=model.config.max_seq_len,
            device=str(next(model.parameters()).device),
        )
        logits = model(x)
        loss = F.cross_entropy(
            logits.view(-1, logits.size(-1)),
            y.view(-1),
        )
        losses.append(loss.item())
    avg_loss = sum(losses) / len(losses)
    return avg_loss, math.exp(avg_loss)

200 batches × 32 sequences × 256 tokens ≈ 1.6M tokens — enough for a stable estimate. Smaller and you’ll see ±5% noise between runs.

The big caveat with perplexity: it’s only meaningful within the same tokenizer and same test set. A model with a smaller vocab will always look better on PPL because each token decision has fewer alternatives. If you compare your model’s PPL to GPT-2’s PPL on TinyStories, you’re comparing apples to oranges unless both use the same tokenization.

The Perplexity Calculator demo shows this concretely — same text, three corpora, three completely different PPL numbers. The relative ranking still tells you something useful, but the absolute values don’t transfer.

Lens 2: Generation samples

PPL doesn’t tell you whether the model can write a coherent story. For that, you read.

# tiny_llm/eval.py
@torch.no_grad()
def sample_generations(
    model: GPT,
    tokenizer,
    prompts: list[str],
    max_new_tokens: int = 80,
    temperature: float = 0.8,
    top_p: float = 0.95,
) -> list[str]:
    """Generate completions for each prompt; return as strings."""
    model.eval()
    outputs = []
    for prompt in prompts:
        ids = torch.tensor([tokenizer.encode(prompt)])
        out_ids = model.sample(
            ids, max_new_tokens=max_new_tokens,
            temperature=temperature, top_p=top_p,
        )
        outputs.append(tokenizer.decode(out_ids[0].tolist()))
    return outputs


EVAL_PROMPTS = [
    "Once upon a time there was a little girl who",
    "The dog and the cat were best friends. One day",
    "In the forest, a small mouse found",
    "Tim wanted to learn how to",
    "The sun was setting over the hill when",
]

Five prompts that cover the “shape” of TinyStories — narrative openings the model should be able to continue if it’s learned anything.

What you’re looking for, in roughly increasing strictness:

  1. Words. Output is real English words, not character soup. Step 0 of training fails this; step 100 passes.
  2. Local grammar. Subject-verb agreement, basic pronoun consistency. Step 500-ish passes this on TinyStories.
  3. Coherent sentences. Each sentence on its own makes sense. Step 1000-2000 passes.
  4. Story structure. Beginning, middle, end. Characters introduced and referred back to. Step 5000+ on the SMALL config passes this.
  5. Adherence to the prompt. Continues what was asked, not whatever the model’s distribution prefers. Often the last to develop and the first to fail under scaling pressure.

Reading 5 generations takes 30 seconds and tells you more than another decimal of PPL.

Lens 3: LLM-as-judge

The hard problem with sample reading: it doesn’t scale, isn’t reproducible, and your judgment shifts day-to-day. LLM-as-judge is the standard fix — use a stronger model to grade outputs against a rubric.

For our setup we have two options:

  1. Use the model itself as the judge (cheap, biased — the model can’t grade better than it can write)
  2. Use an external API (OpenAI, Anthropic, etc. — costs money, but more reliable)

For learning purposes, the self-judge version is fine. We hand the model a generation and a rubric, ask it to score, and see whether the scores correlate with our own reading.

# tiny_llm/eval.py
JUDGE_RUBRIC = """\
Rate the following generated story on three criteria, on a 1-5 scale:

1. Grammaticality: are the sentences well-formed?
2. Coherence: does the story flow logically?
3. Story structure: does it have a beginning, middle, and end?

Generated story:
\"\"\"
{generation}
\"\"\"

Respond with three numbers separated by spaces, like: 4 3 4
"""


def llm_as_judge(judge_model, judge_tokenizer, generation: str) -> dict[str, int]:
    """Use a model to score one generation. Returns dict of {criterion: score}.

    For the toy version, we pass our own model as judge. Better: use
    a stronger external model via API.
    """
    prompt = JUDGE_RUBRIC.format(generation=generation)
    ids = torch.tensor([judge_tokenizer.encode(prompt)])
    out = judge_model.sample(ids, max_new_tokens=20, temperature=0)  # greedy for stability
    text = judge_tokenizer.decode(out[0].tolist())
    after_prompt = text[len(prompt):].strip().split()[:3]

    try:
        scores = [int(s) for s in after_prompt]
    except ValueError:
        return {"grammaticality": 0, "coherence": 0, "structure": 0}

    return {
        "grammaticality": scores[0] if len(scores) > 0 else 0,
        "coherence":      scores[1] if len(scores) > 1 else 0,
        "structure":      scores[2] if len(scores) > 2 else 0,
    }

The TinyStories-trained model is too small to be a useful judge of anything subtle — but it can recognize gibberish vs. real text, and you’ll see scores rise as the model improves. For real eval, swap judge_model and judge_tokenizer for an OpenAI/Anthropic API client.

The LLM-as-Judge demo on this site shows what the rubric step looks like in detail, including how to weight criteria differently.

Putting it together: a one-shot eval

# tiny_llm/eval.py
def evaluate(model: GPT, tokenizer, valid_data) -> dict:
    """Run all three lenses, return a summary dict."""
    print("=" * 60)
    print(" eval report")
    print("=" * 60)

    # Lens 1: perplexity
    print("\n[1/3] perplexity on 1.6M tokens of validation...")
    loss, ppl = perplexity(model, valid_data)
    print(f"      val loss: {loss:.3f} nats/tok = {loss/math.log(2):.3f} bits/tok")
    print(f"      val perplexity: {ppl:.2f}")

    # Lens 2: generation samples
    print("\n[2/3] generation samples (5 prompts)...")
    gens = sample_generations(model, tokenizer, EVAL_PROMPTS)
    for i, (p, g) in enumerate(zip(EVAL_PROMPTS, gens)):
        completion = g[len(p):].strip()
        print(f"\n  ── prompt {i+1} ──")
        print(f"  > {p}")
        print(f"    {completion[:200]}{'...' if len(completion) > 200 else ''}")

    # Lens 3: LLM-as-judge (using the model itself for the toy version)
    print("\n[3/3] self-judged scores on the 5 generations...")
    avg_scores = {"grammaticality": 0.0, "coherence": 0.0, "structure": 0.0}
    for g in gens:
        scores = llm_as_judge(model, tokenizer, g)
        for k, v in scores.items():
            avg_scores[k] += v / len(gens)
    print(f"      avg grammar:   {avg_scores['grammaticality']:.1f} / 5")
    print(f"      avg coherence: {avg_scores['coherence']:.1f} / 5")
    print(f"      avg structure: {avg_scores['structure']:.1f} / 5")

    return {
        "loss": loss,
        "perplexity": ppl,
        "samples": gens,
        "judge_scores": avg_scores,
    }

Run after training:

if __name__ == "__main__":
    from tiny_llm.train import GPTConfig
    from tiny_llm.data import prepare

    ckpt = torch.load("checkpoints/best.pt", weights_only=False)
    model = GPT(ckpt["gpt_config"])
    model.load_state_dict(ckpt["model"])
    tok = prepare()
    valid = load_token_array(DATA_DIR / "valid.bin")

    evaluate(model, tok, valid)

Expected output (numbers depend on your training run):

============================================================
 eval report
============================================================

[1/3] perplexity on 1.6M tokens of validation...
      val loss: 1.713 nats/tok = 2.471 bits/tok
      val perplexity: 5.55

[2/3] generation samples (5 prompts)...

  ── prompt 1 ──
  > Once upon a time there was a little girl who
    loved to play with her dog. One day she went to the park and met...

  ── prompt 2 ──
  > The dog and the cat were best friends. One day
    they decided to play hide and seek in the garden...

  ...

[3/3] self-judged scores on the 5 generations...
      avg grammar:   3.4 / 5
      avg coherence: 2.8 / 5
      avg structure: 2.6 / 5

Now you have a baseline. After step 12’s LoRA fine-tune, run the same eval — the perplexity may not move much (LoRA preserves base behavior) but the structure score should rise on instruction-formatted prompts.

Three traps to avoid

Trap 1: Goodharting the loss. Your training loss can drop while the actual model gets worse — you’re just memorizing the training data. The PPL on a held-out validation set is the real signal. If train loss falls and val loss flattens or rises, you’re overfitting.

Trap 2: Cherry-picking generations. Run the same prompt 10 times with random sampling and report the median. One lucky generation per change isn’t progress; it’s the temperature noise. Step 10 is relevant here — temperature=0.8, top_p=0.95 is the rough “honest” sampler for evaluating writing quality.

Trap 3: Trusting a single judge. LLM-as-judge has known biases — preferring longer outputs, favoring its own model family, being too lenient. Production eval pipelines use multiple judges and take the median, or compare to human ratings on a held-out set. The LLM-as-Judge demo walks through some of these.

What we did and didn’t do

What we did:

  • A perplexity() function with 1.6M-token coverage for stable estimates
  • A sample_generations() helper with the same prompts you can use across training runs
  • A toy llm_as_judge() that uses the model itself as judge
  • An evaluate() one-shot that runs all three and prints a report

What we didn’t:

  • Standardized benchmarks like MMLU, HellaSwag, GSM8K, or TruthfulQA. Useful for comparing across models, but they target capabilities our 5M-param model doesn’t have. Once you scale to GPT-2-small territory, plug in lm-evaluation-harness — it runs all of these as a one-liner.
  • Calibration metrics (ECE, reliability diagrams). The Calibration Lab demo is the conceptual primer; this is the metric production deployments actually monitor.
  • Per-token logprob analysis. The Perplexity Calculator demo shows surprisal per token; super useful for debugging what the model finds confusing in a specific input.
  • Pairwise preference eval (Arena-style). Take two models, generate from each on the same prompt, ask a judge which is better. The LLM-as-Judge demo is the framework; it’s how Chatbot Arena ranks frontier models.

Cross-references

Next

Step 14 takes the model and makes inference fast. The trick: KV caching. During autoregressive generation, every new token re-attends to all previous tokens — but K and V for those previous tokens don’t change once they’re computed. We cache them and skip the recomputation, getting near-constant per-token time instead of quadratic. We’ll also export the model to ONNX for portable deployment.