step 15 · build

Run it in your browser

Capstone — load your trained ONNX model into onnxruntime-web and generate text from a single HTML page. No Python, no server.

inferencecapstone

This is the closing image of the curriculum. You’ve built every piece. Now we put your trained model into a single HTML file, hand it to a browser, and let it generate text without touching Python again. No Colab kernel, no server. Just tiny_llm.onnx, a tokenizer JSON, and ~150 lines of HTML+JS.

By the end you’ll have a static page you can open with double-click. Type a prompt, click “Generate,” watch your model — the one you built in steps 02–09, fine-tuned in step 12, exported in step 14 — produce text in real time.

That’s the experiential point of the whole curriculum: a working LLM, every line of which you wrote, running anywhere there’s a browser.

What we’re building

A directory like:

browser-demo/
├── index.html
├── app.js
├── tiny_llm.onnx              # exported in step 14
└── tokenizer.json             # exported in this step

Open index.html. Get a model.

The architecture: onnxruntime-web loads the ONNX file and runs inference via WebGPU (if available) or WebAssembly (fallback). A pure-JS port of our BPE tokenizer handles encoding/decoding. JS sampling logic mirrors model.sample() from step 10.

Step 1: serialize the tokenizer

The Python BPETokenizer from step 02 stores vocab and merges. We need both in a portable JSON the browser can load.

Add a serialization method to tiny_llm/tokenize.py:

# tiny_llm/tokenize.py — add to BPETokenizer class
import json
from pathlib import Path

def save(self, path: Path) -> None:
    """Save vocab + merges as a single JSON file."""
    data = {
        "vocab": self.vocab,
        "merges": self.merges,
        "special_tokens": self.SPECIAL_TOKENS,
    }
    Path(path).write_text(json.dumps(data, ensure_ascii=False, indent=2))

@classmethod
def load(cls, path: Path) -> "BPETokenizer":
    """Inverse of save()."""
    data = json.loads(Path(path).read_text())
    tok = cls()
    tok.vocab = data["vocab"]
    tok.id_to_token = {i: t for t, i in tok.vocab.items()}
    tok.merges = [tuple(m) for m in data["merges"]]
    return tok

Now save the tokenizer alongside your ONNX export. Adjust tiny_llm/export.py:

# tiny_llm/export.py — add at the bottom of __main__
from tiny_llm.data import prepare
tok = prepare()
tok.save(Path("checkpoints/tokenizer.json"))
print(f"saved tokenizer.json ({Path('checkpoints/tokenizer.json').stat().st_size / 1024:.1f} KB)")

Re-run:

uv run python -m tiny_llm.export

You should now have:

checkpoints/tiny_llm.onnx       (~21 MB)
checkpoints/tokenizer.json      (~80 KB at 4096 vocab)

Both small enough to ship as static assets.

Step 2: the HTML scaffold

Create browser-demo/index.html:

<!doctype html>
<html lang="en">
<head>
  <meta charset="utf-8">
  <title>tiny-llm in your browser</title>
  <style>
    body { font-family: system-ui, sans-serif; max-width: 720px; margin: 2rem auto; padding: 0 1rem; line-height: 1.5; }
    textarea { width: 100%; min-height: 80px; padding: 0.5rem; font: 14px/1.5 monospace; }
    button { padding: 0.5rem 1rem; font-size: 1rem; cursor: pointer; }
    pre { background: #f4f4f4; padding: 1rem; border-radius: 4px; white-space: pre-wrap; min-height: 4rem; }
    #status { font-size: 0.9rem; color: #666; }
  </style>
</head>
<body>
  <h1>tiny-llm</h1>
  <p>A 5M-parameter model, trained from scratch, running entirely in your browser via WebGPU.</p>

  <div id="status">loading model…</div>

  <textarea id="prompt" placeholder="Type a prompt, e.g. 'Once upon a time'"></textarea>
  <p>
    <label>tokens to generate: <input type="number" id="ntokens" value="80" min="1" max="200"></label>
    <label>temperature: <input type="number" id="temp" value="0.8" step="0.1" min="0" max="2"></label>
    <button id="generate" disabled>Generate</button>
  </p>

  <pre id="output"></pre>

  <!-- onnxruntime-web from a CDN. For production, vendor it locally. -->
  <script src="https://cdn.jsdelivr.net/npm/onnxruntime-web@1.19.0/dist/ort.min.js"></script>
  <script type="module" src="./app.js"></script>
</body>
</html>

Plain HTML — no framework. The whole thing is one app.js.

Step 3: the JavaScript

Create browser-demo/app.js:

// app.js — load ONNX model, port BPE tokenizer to JS, sampling loop.

const status = document.getElementById('status');
const button = document.getElementById('generate');
const promptEl = document.getElementById('prompt');
const outputEl = document.getElementById('output');

// === Load ONNX model and tokenizer ===

async function loadModel() {
  status.textContent = 'loading tokenizer…';
  const tokResponse = await fetch('./tokenizer.json');
  const tokData = await tokResponse.json();

  status.textContent = 'loading model (~21 MB)…';
  const session = await ort.InferenceSession.create('./tiny_llm.onnx', {
    executionProviders: ['webgpu', 'wasm'],   // try WebGPU first, fall back to WASM
    graphOptimizationLevel: 'all',
  });

  status.textContent = 'ready · execution: ' + session.executionProvider;
  button.disabled = false;

  return { session, tokenizer: makeTokenizer(tokData) };
}

// === BPE tokenizer in JavaScript ===
// Mirror of tiny_llm/tokenize.py's encode/decode logic.

function makeTokenizer(data) {
  const vocab = data.vocab;                    // { "the": 12, ... }
  const merges = data.merges.map(([a, b]) => [a, b]);
  const idToToken = {};
  for (const [tok, id] of Object.entries(vocab)) idToToken[id] = tok;

  function tokenizeWord(word) {
    let tokens = word.split('');
    tokens.push('</w>');
    for (const [a, b] of merges) {
      const merged = [];
      let i = 0;
      while (i < tokens.length) {
        if (i < tokens.length - 1 && tokens[i] === a && tokens[i + 1] === b) {
          merged.push(a + b);
          i += 2;
        } else {
          merged.push(tokens[i]);
          i += 1;
        }
      }
      tokens = merged;
    }
    return tokens;
  }

  function encode(text) {
    const ids = [];
    for (const line of text.split('\n')) {
      for (const word of line.split(/\s+/).filter(Boolean)) {
        for (const tok of tokenizeWord(word)) {
          ids.push(vocab[tok] ?? vocab['<|unk|>']);
        }
      }
    }
    return ids;
  }

  function decode(ids) {
    return ids
      .map((id) => idToToken[id] ?? '<|unk|>')
      .join('')
      .replace(/<\/w>/g, ' ')
      .trim();
  }

  return { encode, decode, vocabSize: Object.keys(vocab).length };
}

// === Sampling loop ===

function softmax(logits, temperature) {
  const scaled = logits.map((x) => x / temperature);
  const max = Math.max(...scaled);
  const exps = scaled.map((x) => Math.exp(x - max));
  const sum = exps.reduce((a, b) => a + b, 0);
  return exps.map((e) => e / sum);
}

function sampleFromProbs(probs) {
  const r = Math.random();
  let cum = 0;
  for (let i = 0; i < probs.length; i++) {
    cum += probs[i];
    if (r < cum) return i;
  }
  return probs.length - 1;
}

async function generate(session, tokenizer, prompt, nTokens, temperature) {
  let ids = tokenizer.encode(prompt);
  const decoder = (idsArr) => prompt + tokenizer.decode(idsArr.slice(ids.length));

  for (let step = 0; step < nTokens; step++) {
    // Build the input tensor: BigInt64Array, shape [1, ids.length].
    const inputArr = BigInt64Array.from(ids.map(BigInt));
    const inputTensor = new ort.Tensor('int64', inputArr, [1, ids.length]);

    // Run the model. Returns { logits: Tensor of shape [1, T, vocab_size] }.
    const results = await session.run({ token_ids: inputTensor });
    const logits = results.logits.data;          // Float32Array, length 1 * T * V
    const V = tokenizer.vocabSize;
    const T = ids.length;

    // Take the last position's logits.
    const lastLogits = Array.from(logits.slice((T - 1) * V, T * V));
    const probs = softmax(lastLogits, temperature);
    const nextId = sampleFromProbs(probs);

    ids.push(nextId);

    // Live update — show the partial output as it streams.
    outputEl.textContent = decoder(ids);
  }

  return decoder(ids);
}

// === Wire it up ===

const { session, tokenizer } = await loadModel();

button.addEventListener('click', async () => {
  button.disabled = true;
  outputEl.textContent = '';
  try {
    const prompt = promptEl.value.trim() || 'Once upon a time';
    const nTokens = parseInt(document.getElementById('ntokens').value, 10);
    const temperature = parseFloat(document.getElementById('temp').value);
    await generate(session, tokenizer, prompt, nTokens, temperature);
  } catch (e) {
    outputEl.textContent = 'Error: ' + e.message;
    console.error(e);
  } finally {
    button.disabled = false;
  }
});

Some things worth understanding:

Execution providers. webgpu is the new browser GPU backend; ~10–20× faster than WASM for transformer inference on capable browsers (Chrome, Edge). If WebGPU isn’t available (Safari as of writing, or older browsers), it falls back to wasm, which is fine but slower. The status element shows which one was actually used.

BigInt64Array for token IDs. ONNX exports our model with int64 token IDs (PyTorch’s default). JavaScript represents int64 as BigInt, hence the cast. Slightly fiddly but unavoidable.

Per-step full forward pass. This is the uncached version. Each step re-runs the model on the entire prefix — quadratic in output length. For 80 tokens it’s tolerable; for 500 you’d want the KV-cached ONNX export from step 14’s “what we didn’t” section.

No top-p filter in this script. I left it out for clarity; it’s a 15-line addition (sort logits descending, find the cumulative-prob crossing, mask). Same logic as Python sample() in step 10.

Step 4: serve and run

Browsers won’t fetch() files from file:// URLs, so we need a local static server. The shortest path:

cd browser-demo
python -m http.server 8000

Then open http://localhost:8000/ in Chrome. The first load downloads the ONNX file (~21 MB, then cached); subsequent loads are instant.

Type “Once upon a time” and click Generate. You should see your model’s output stream in, token by token.

What it actually feels like

The first time it works, the visceral fact lands: this is your model running. Not GPT-4 calling a server you’re paying. Not a HuggingFace demo with a Spaces backend. Your weights, in your browser, on your device, generating tokens you can copy and paste.

The output is bounded by the model’s quality — 5M params on TinyStories isn’t going to write War and Peace. But the model understands the prompt format, produces grammatical English, and stays on theme. That’s a genuinely-trained-from-scratch language model fitting in an HTML file.

Performance notes

On a recent laptop running Chrome with WebGPU enabled:

Config	First load	Per-token (WebGPU)	Per-token (WASM)
TINY (5M)	~1s	~50ms	~200ms
SMALL (17M)	~3s	~80ms	~600ms
MEDIUM (85M)	~12s	~250ms	~3s

WASM is fine for the smallest configs; for anything larger you really want WebGPU. Safari and older browsers will fall back to WASM and feel slow on the medium config.

To cut load time, quantize the ONNX file. Add to tiny_llm/export.py:

from onnxruntime.quantization import quantize_dynamic, QuantType
quantize_dynamic(
    Path("checkpoints/tiny_llm.onnx"),
    Path("checkpoints/tiny_llm.q8.onnx"),
    weight_type=QuantType.QInt8,
)

Cuts file size by ~4× with negligible quality loss at our scale. Point app.js at the quantized version.

What we did and didn’t do

What we did:

Tokenizer JSON serialization
A static HTML+JS app that loads the ONNX model and the tokenizer
BPE encode/decode ported to JavaScript
Sampling with temperature
Live token streaming as output is generated
WebGPU + WASM fallback

What we didn’t:

In-browser training. What I’d love to ship: train a 1M-param model in the browser via WebGPU and watch it learn. The infrastructure exists (webnn, onnxruntime-web training mode, tinygrad) but it’s still in-flight. For now, training stays in PyTorch; only inference runs in the browser.
Streaming via the KV cache. The current loop re-runs the full forward pass every step. ~10× speedup is available if you export with cache support and update the JS to maintain it across calls.
Top-p / top-k in JS. Only temperature in the demo above; the production sampler should match sample() from step 10. Easy to add.
A nicer UI. Mine is intentionally barebones (system fonts, default elements). The point is the model works; the framing is yours to design.

Cross-references

Inference Pipeline demo — the full forward pass, instrumented step by step. The browser demo is a simplified version of what that demo runs.
Quantization Lab demo — what int8 quantization actually does to weights.
Cost & Latency Calculator demo — production-scale tradeoffs around batching, caching, and speculative decoding.

Step 16 is the “where to go from here” article — the curriculum’s exit ramp. RoPE, MoE, RLHF/DPO, distillation, the actual papers behind production LLMs, and pointers to the next ~dozen things you might want to learn now that the base model is no longer mysterious.