Inference Pipeline · ai-explained

What you can see in this demo

Tokenization is the entry point. Whatever you type, the model never sees characters — it sees a list of vocabulary ids. Try the same prompt in Tokenizer Surgery to see this in a tighter focus.
The residual stream evolves through depth. The per-layer strips show the L2 norm of the residual at each block. Early layers track surface features; later layers consolidate semantics. Same number of dimensions throughout — the stream just gets denser.
The whole transformer collapses to one distribution. After 12 blocks of attention + MLP, all that fanout collapses back into a single vector of length 50,257 (the vocab) — one score per candidate next token.
Sampling is where chatbot-feel comes from. Slide temperature: under 1, the distribution sharpens; over 1, it flattens. Slide top-p: kept tokens are the smallest set whose probabilities cumulatively reach p. This is how products tune diversity vs. determinism without retraining.
It's autoregressive — one token at a time. Step through the generation. Each step picks one token, appends it to the context, and the whole forward pass runs again. There is no separate "answer" — there's just predict the next token, looped.

What's actually happening at each box

Tokens. BPE chunks your prompt into vocabulary ids. Whitespace is part of the token (the leading · renders a leading space).
Embeddings. Each token id indexes into a [50,257 × 768] matrix to retrieve a 768-dim vector. A separate positional embedding for each position is added on top, so the model knows which token came first. Result is the initial residual stream.
12 Transformer blocks. Each one: self-attention (12 heads), residual add, layer-norm, MLP, residual add, layer-norm. Information mixes across positions in attention, gets transformed locally in the MLP, and accumulates in the residual stream. See Attention Inspector for what one of these blocks does in detail.
Unembedding. The final residual stream is multiplied by a [768 × 50,257] matrix (the transpose of the input embedding, by convention) to produce a score per vocabulary token. These scores are logits.
Softmax → probabilities. Logits become a probability distribution. With temperature 1, this is the model's "honest" prediction. Temperature scales the logits before softmax; top-p keeps a cumulative cap of mass.
Pick one token. Greedy = always argmax. With temperature/top-p, sample from the reshaped distribution. Append the picked token to the context and run the whole thing again.

How it's powered

The pipeline data is a pre-computed forward pass of real GPT-2 small (124M params, 12 layers × 12 heads, d=768) on five curated prompts. For each prompt, we record the input tokenization, a per-layer residual-norm summary, and the top-30 next-token candidates at every greedy generation step. The total payload is ~150 KB.

Why pre-compute: running GPT-2 in the browser via transformers.js works but adds ~125 MB of model weights to the page load and several seconds of warm-up. For a teaching demo, the static JSON is better in every way except "let users type their own prompt" — which is the obvious v4.

Anchored to 06-transformers/transformer-block and 08-prompting/sampling-and-decoding from the learning path.

Question in, answer out

What you can see in this demo

What's actually happening at each box

How it's powered