demo

From 32 bits to 4 to 2

Quantization: cast model weights from FP32 to INT8 to INT4. Watch memory drop 8× while error stays manageable. The trick that lets you run a 70B model on a Raspberry Pi.

The math (this is what the demo computes)

# quantize (FP32 → integer):
scale     = (max(W) − min(W)) / (2^bits − 1)   # asymmetric
zero      = round(−min(W) / scale)             # so 0 maps to a valid int
W_int     = round(W / scale + zero)            # integers in [0, 2^bits − 1]

# dequantize (integer → FP32):
W_back    = (W_int − zero) · scale

# error per cell:
err       = W_back − W                          # bounded by ±scale/2
RMSE      = √(mean(err²))

# memory ratio:
ratio     = bits / 32                            # 4-bit = 1/8 the memory

Try this — predict before you click

  1. Start with 16 bits (FP16 territory). Predict: error matrix is mostly black; RMSE ≪ 0.01; memory ratio is 0.5. This is "free" — most production models ship in FP16 by default.
  2. Drop to 8 bits (INT8). Predict: memory ratio = 0.25 (4× compression); RMSE rises but is still tiny; the error matrix shows tiny grid-like patterns from the rounding. This is what most "quantized" models ship as.
  3. Drop to 4 bits (the QLoRA / GPTQ / AWQ default). Predict: memory ratio = 0.125 (8×); RMSE roughly 4× higher than INT8; error matrix shows visible banding. Acceptable for inference, painful for fine-tuning without LoRA.
  4. Drop to 2 bits. Predict: only 4 possible values per weight; the quantized matrix collapses to a coarse posterization; RMSE shoots up. This is the "research territory" tier — you need clever per-channel scales (HQQ, BitNet) to make it work.
  5. At 4 bits, toggle between linear and asymmetric schemes on the same data. Predict: asymmetric usually wins on RMSE because it adjusts the zero-point. Linear is simpler but wastes a bit of the range when the weights aren't centered at 0.

Anchored to 13-production/cost-and-latency and 10-fine-tuning/lora-and-qlora. Code-side: /ship/14 — cost and latency.