demo
Question in, answer out
The whole pipeline, top to bottom: a question becomes tokens, tokens become vectors, vectors flow through twelve transformer blocks, and out the other side comes a probability distribution over 50,000 candidate next tokens. Pick one, append it, repeat. That loop is the entire model.
What you can see in this demo
- Tokenization is the entry point. Whatever you type, the model never sees characters — it sees a list of vocabulary ids. Try the same prompt in Tokenizer Surgery to see this in a tighter focus.
- The residual stream evolves through depth. The per-layer strips show the L2 norm of the residual at each block. Early layers track surface features; later layers consolidate semantics. Same number of dimensions throughout — the stream just gets denser.
- The whole transformer collapses to one distribution. After 12 blocks of attention + MLP, all that fanout collapses back into a single vector of length 50,257 (the vocab) — one score per candidate next token.
- Sampling is where chatbot-feel comes from.
Slide temperature: under 1, the distribution sharpens; over 1,
it flattens. Slide top-p: kept tokens are the smallest set
whose probabilities cumulatively reach
p. This is how products tune diversity vs. determinism without retraining. - It's autoregressive — one token at a time. Step through the generation. Each step picks one token, appends it to the context, and the whole forward pass runs again. There is no separate "answer" — there's just predict the next token, looped.
What's actually happening at each box
- Tokens. BPE chunks your prompt into vocabulary
ids. Whitespace is part of the token (the leading
·renders a leading space). - Embeddings. Each token id indexes into a
[50,257 × 768]matrix to retrieve a 768-dim vector. A separate positional embedding for each position is added on top, so the model knows which token came first. Result is the initial residual stream. - 12 Transformer blocks. Each one: self-attention (12 heads), residual add, layer-norm, MLP, residual add, layer-norm. Information mixes across positions in attention, gets transformed locally in the MLP, and accumulates in the residual stream. See Attention Inspector for what one of these blocks does in detail.
- Unembedding. The final residual stream is
multiplied by a
[768 × 50,257]matrix (the transpose of the input embedding, by convention) to produce a score per vocabulary token. These scores are logits. - Softmax → probabilities. Logits become a probability distribution. With temperature 1, this is the model's "honest" prediction. Temperature scales the logits before softmax; top-p keeps a cumulative cap of mass.
- Pick one token. Greedy = always argmax. With temperature/top-p, sample from the reshaped distribution. Append the picked token to the context and run the whole thing again.
How it's powered
The pipeline data is a pre-computed forward pass of real GPT-2 small (124M params, 12 layers × 12 heads, d=768) on five curated prompts. For each prompt, we record the input tokenization, a per-layer residual-norm summary, and the top-30 next-token candidates at every greedy generation step. The total payload is ~150 KB.
Why pre-compute: running GPT-2 in the browser via transformers.js works but adds ~125 MB of model
weights to the page load and several seconds of warm-up. For a
teaching demo, the static JSON is better in every way except
"let users type their own prompt" — which is the obvious v4.
Anchored to 06-transformers/transformer-block
and 08-prompting/sampling-and-decoding
from the learning path.