demo
From 32 bits to 4 to 2
Quantization: cast model weights from FP32 to INT8 to INT4. Watch memory drop 8× while error stays manageable. The trick that lets you run a 70B model on a Raspberry Pi.
The math (this is what the demo computes)
# quantize (FP32 → integer):
scale = (max(W) − min(W)) / (2^bits − 1) # asymmetric
zero = round(−min(W) / scale) # so 0 maps to a valid int
W_int = round(W / scale + zero) # integers in [0, 2^bits − 1]
# dequantize (integer → FP32):
W_back = (W_int − zero) · scale
# error per cell:
err = W_back − W # bounded by ±scale/2
RMSE = √(mean(err²))
# memory ratio:
ratio = bits / 32 # 4-bit = 1/8 the memory Try this — predict before you click
- Start with 16 bits (FP16 territory). Predict: error matrix is mostly black; RMSE ≪ 0.01; memory ratio is 0.5. This is "free" — most production models ship in FP16 by default.
- Drop to 8 bits (INT8). Predict: memory ratio = 0.25 (4× compression); RMSE rises but is still tiny; the error matrix shows tiny grid-like patterns from the rounding. This is what most "quantized" models ship as.
- Drop to 4 bits (the QLoRA / GPTQ / AWQ default). Predict: memory ratio = 0.125 (8×); RMSE roughly 4× higher than INT8; error matrix shows visible banding. Acceptable for inference, painful for fine-tuning without LoRA.
- Drop to 2 bits. Predict: only 4 possible values per weight; the quantized matrix collapses to a coarse posterization; RMSE shoots up. This is the "research territory" tier — you need clever per-channel scales (HQQ, BitNet) to make it work.
- At 4 bits, toggle between linear and asymmetric schemes on the same data. Predict: asymmetric usually wins on RMSE because it adjusts the zero-point. Linear is simpler but wastes a bit of the range when the weights aren't centered at 0.
Anchored to 13-production/cost-and-latency
and 10-fine-tuning/lora-and-qlora.
Code-side: /ship/14 — cost and latency.