step 13 · build
Evaluate honestly
Three lenses on model quality — perplexity, generation samples, LLM-as-judge. None complete on its own; together they let you make decisions instead of vibes.
You’ve trained a model. You’ve fine-tuned it. The training-loss curve looks reasonable. Is the model any good? That’s a different question, and one that takes more than a single number to answer.
This step builds a tiny eval harness with three independent lenses:
- Perplexity on a held-out set — the metric every paper opens with
- Generation samples — qualitative read-through of what the model actually produces
- Lightweight LLM-as-judge — programmatic grading using a stronger model (or a stricter rubric)
Each one is misleading in isolation. Perplexity is comparable across model sizes only within the same tokenizer. Generation samples are subjective. LLM-as-judge has its own biases. The trick is combining them so that when all three agree, you trust the result; when they disagree, you investigate.
Lens 1: Perplexity
We covered the formula in step 09’s training loop:
PPL = exp(cross_entropy_loss)
Or equivalently, 2^(loss in bits/token). PPL is the model’s branching factor — roughly, how many equally-likely next tokens the model is hesitating between. PPL = 1 means perfect prediction; PPL = vocab_size means uniform random.
For our setup:
| Stage | Approx PPL |
|---|---|
| Untrained model | 4096 (uniform over our 4k vocab) |
| After step 09 (~5K steps) | ~5.5 |
| After step 12 LoRA fine-tune | depends on task |
| GPT-2 small on TinyStories | ~3.0 |
| GPT-4 on TinyStories | ~2.4 |
The PPL gap from 5.5 → 3 is what scaling buys you (step 11 numbers); the gap from 3 → 2.4 is the diminishing-returns territory at the top.
The eval helper from step 09 already computes this:
# tiny_llm/eval.py
import math
from pathlib import Path
import torch
import torch.nn.functional as F
from tiny_llm.gpt import GPT
from tiny_llm.data import load_token_array, get_batch, DATA_DIR
@torch.no_grad()
def perplexity(model: GPT, data, batch_size: int = 32, n_iters: int = 200) -> tuple[float, float]:
"""Average cross-entropy and perplexity over `n_iters` batches.
Returns (loss_in_nats, perplexity).
"""
model.eval()
losses = []
for _ in range(n_iters):
x, y = get_batch(
data,
batch_size=batch_size,
seq_len=model.config.max_seq_len,
device=str(next(model.parameters()).device),
)
logits = model(x)
loss = F.cross_entropy(
logits.view(-1, logits.size(-1)),
y.view(-1),
)
losses.append(loss.item())
avg_loss = sum(losses) / len(losses)
return avg_loss, math.exp(avg_loss)
200 batches × 32 sequences × 256 tokens ≈ 1.6M tokens — enough for a stable estimate. Smaller and you’ll see ±5% noise between runs.
The big caveat with perplexity: it’s only meaningful within the same tokenizer and same test set. A model with a smaller vocab will always look better on PPL because each token decision has fewer alternatives. If you compare your model’s PPL to GPT-2’s PPL on TinyStories, you’re comparing apples to oranges unless both use the same tokenization.
The Perplexity Calculator demo shows this concretely — same text, three corpora, three completely different PPL numbers. The relative ranking still tells you something useful, but the absolute values don’t transfer.
Lens 2: Generation samples
PPL doesn’t tell you whether the model can write a coherent story. For that, you read.
# tiny_llm/eval.py
@torch.no_grad()
def sample_generations(
model: GPT,
tokenizer,
prompts: list[str],
max_new_tokens: int = 80,
temperature: float = 0.8,
top_p: float = 0.95,
) -> list[str]:
"""Generate completions for each prompt; return as strings."""
model.eval()
outputs = []
for prompt in prompts:
ids = torch.tensor([tokenizer.encode(prompt)])
out_ids = model.sample(
ids, max_new_tokens=max_new_tokens,
temperature=temperature, top_p=top_p,
)
outputs.append(tokenizer.decode(out_ids[0].tolist()))
return outputs
EVAL_PROMPTS = [
"Once upon a time there was a little girl who",
"The dog and the cat were best friends. One day",
"In the forest, a small mouse found",
"Tim wanted to learn how to",
"The sun was setting over the hill when",
]
Five prompts that cover the “shape” of TinyStories — narrative openings the model should be able to continue if it’s learned anything.
What you’re looking for, in roughly increasing strictness:
- Words. Output is real English words, not character soup. Step 0 of training fails this; step 100 passes.
- Local grammar. Subject-verb agreement, basic pronoun consistency. Step 500-ish passes this on TinyStories.
- Coherent sentences. Each sentence on its own makes sense. Step 1000-2000 passes.
- Story structure. Beginning, middle, end. Characters introduced and referred back to. Step 5000+ on the SMALL config passes this.
- Adherence to the prompt. Continues what was asked, not whatever the model’s distribution prefers. Often the last to develop and the first to fail under scaling pressure.
Reading 5 generations takes 30 seconds and tells you more than another decimal of PPL.
Lens 3: LLM-as-judge
The hard problem with sample reading: it doesn’t scale, isn’t reproducible, and your judgment shifts day-to-day. LLM-as-judge is the standard fix — use a stronger model to grade outputs against a rubric.
For our setup we have two options:
- Use the model itself as the judge (cheap, biased — the model can’t grade better than it can write)
- Use an external API (OpenAI, Anthropic, etc. — costs money, but more reliable)
For learning purposes, the self-judge version is fine. We hand the model a generation and a rubric, ask it to score, and see whether the scores correlate with our own reading.
# tiny_llm/eval.py
JUDGE_RUBRIC = """\
Rate the following generated story on three criteria, on a 1-5 scale:
1. Grammaticality: are the sentences well-formed?
2. Coherence: does the story flow logically?
3. Story structure: does it have a beginning, middle, and end?
Generated story:
\"\"\"
{generation}
\"\"\"
Respond with three numbers separated by spaces, like: 4 3 4
"""
def llm_as_judge(judge_model, judge_tokenizer, generation: str) -> dict[str, int]:
"""Use a model to score one generation. Returns dict of {criterion: score}.
For the toy version, we pass our own model as judge. Better: use
a stronger external model via API.
"""
prompt = JUDGE_RUBRIC.format(generation=generation)
ids = torch.tensor([judge_tokenizer.encode(prompt)])
out = judge_model.sample(ids, max_new_tokens=20, temperature=0) # greedy for stability
text = judge_tokenizer.decode(out[0].tolist())
after_prompt = text[len(prompt):].strip().split()[:3]
try:
scores = [int(s) for s in after_prompt]
except ValueError:
return {"grammaticality": 0, "coherence": 0, "structure": 0}
return {
"grammaticality": scores[0] if len(scores) > 0 else 0,
"coherence": scores[1] if len(scores) > 1 else 0,
"structure": scores[2] if len(scores) > 2 else 0,
}
The TinyStories-trained model is too small to be a useful judge of anything subtle — but it can recognize gibberish vs. real text, and you’ll see scores rise as the model improves. For real eval, swap judge_model and judge_tokenizer for an OpenAI/Anthropic API client.
The LLM-as-Judge demo on this site shows what the rubric step looks like in detail, including how to weight criteria differently.
Putting it together: a one-shot eval
# tiny_llm/eval.py
def evaluate(model: GPT, tokenizer, valid_data) -> dict:
"""Run all three lenses, return a summary dict."""
print("=" * 60)
print(" eval report")
print("=" * 60)
# Lens 1: perplexity
print("\n[1/3] perplexity on 1.6M tokens of validation...")
loss, ppl = perplexity(model, valid_data)
print(f" val loss: {loss:.3f} nats/tok = {loss/math.log(2):.3f} bits/tok")
print(f" val perplexity: {ppl:.2f}")
# Lens 2: generation samples
print("\n[2/3] generation samples (5 prompts)...")
gens = sample_generations(model, tokenizer, EVAL_PROMPTS)
for i, (p, g) in enumerate(zip(EVAL_PROMPTS, gens)):
completion = g[len(p):].strip()
print(f"\n ── prompt {i+1} ──")
print(f" > {p}")
print(f" {completion[:200]}{'...' if len(completion) > 200 else ''}")
# Lens 3: LLM-as-judge (using the model itself for the toy version)
print("\n[3/3] self-judged scores on the 5 generations...")
avg_scores = {"grammaticality": 0.0, "coherence": 0.0, "structure": 0.0}
for g in gens:
scores = llm_as_judge(model, tokenizer, g)
for k, v in scores.items():
avg_scores[k] += v / len(gens)
print(f" avg grammar: {avg_scores['grammaticality']:.1f} / 5")
print(f" avg coherence: {avg_scores['coherence']:.1f} / 5")
print(f" avg structure: {avg_scores['structure']:.1f} / 5")
return {
"loss": loss,
"perplexity": ppl,
"samples": gens,
"judge_scores": avg_scores,
}
Run after training:
if __name__ == "__main__":
from tiny_llm.train import GPTConfig
from tiny_llm.data import prepare
ckpt = torch.load("checkpoints/best.pt", weights_only=False)
model = GPT(ckpt["gpt_config"])
model.load_state_dict(ckpt["model"])
tok = prepare()
valid = load_token_array(DATA_DIR / "valid.bin")
evaluate(model, tok, valid)
Expected output (numbers depend on your training run):
============================================================
eval report
============================================================
[1/3] perplexity on 1.6M tokens of validation...
val loss: 1.713 nats/tok = 2.471 bits/tok
val perplexity: 5.55
[2/3] generation samples (5 prompts)...
── prompt 1 ──
> Once upon a time there was a little girl who
loved to play with her dog. One day she went to the park and met...
── prompt 2 ──
> The dog and the cat were best friends. One day
they decided to play hide and seek in the garden...
...
[3/3] self-judged scores on the 5 generations...
avg grammar: 3.4 / 5
avg coherence: 2.8 / 5
avg structure: 2.6 / 5
Now you have a baseline. After step 12’s LoRA fine-tune, run the same eval — the perplexity may not move much (LoRA preserves base behavior) but the structure score should rise on instruction-formatted prompts.
Three traps to avoid
Trap 1: Goodharting the loss. Your training loss can drop while the actual model gets worse — you’re just memorizing the training data. The PPL on a held-out validation set is the real signal. If train loss falls and val loss flattens or rises, you’re overfitting.
Trap 2: Cherry-picking generations. Run the same prompt 10 times with random sampling and report the median. One lucky generation per change isn’t progress; it’s the temperature noise. Step 10 is relevant here — temperature=0.8, top_p=0.95 is the rough “honest” sampler for evaluating writing quality.
Trap 3: Trusting a single judge. LLM-as-judge has known biases — preferring longer outputs, favoring its own model family, being too lenient. Production eval pipelines use multiple judges and take the median, or compare to human ratings on a held-out set. The LLM-as-Judge demo walks through some of these.
What we did and didn’t do
What we did:
- A
perplexity()function with 1.6M-token coverage for stable estimates - A
sample_generations()helper with the same prompts you can use across training runs - A toy
llm_as_judge()that uses the model itself as judge - An
evaluate()one-shot that runs all three and prints a report
What we didn’t:
- Standardized benchmarks like MMLU, HellaSwag, GSM8K, or TruthfulQA. Useful for comparing across models, but they target capabilities our 5M-param model doesn’t have. Once you scale to GPT-2-small territory, plug in
lm-evaluation-harness— it runs all of these as a one-liner. - Calibration metrics (ECE, reliability diagrams). The Calibration Lab demo is the conceptual primer; this is the metric production deployments actually monitor.
- Per-token logprob analysis. The Perplexity Calculator demo shows surprisal per token; super useful for debugging what the model finds confusing in a specific input.
- Pairwise preference eval (Arena-style). Take two models, generate from each on the same prompt, ask a judge which is better. The LLM-as-Judge demo is the framework; it’s how Chatbot Arena ranks frontier models.
Cross-references
- Perplexity Calculator — interactive PPL on n-gram models, makes the metric concrete
- LLM-as-Judge — rubric-driven scoring with weighted criteria
- Calibration Lab — reliability diagrams and ECE
- Confusion Matrix Lab — for classification-style fine-tunes, the precision/recall tradeoffs
Next
Step 14 takes the model and makes inference fast. The trick: KV caching. During autoregressive generation, every new token re-attends to all previous tokens — but K and V for those previous tokens don’t change once they’re computed. We cache them and skip the recomputation, getting near-constant per-token time instead of quadratic. We’ll also export the model to ONNX for portable deployment.