demo

Fine-tune a 7B model in 17 MB

LoRA: instead of updating a giant weight matrix, train two thin slices that multiply to approximate the update. Slide the rank, watch the approximation get sharper, see the parameter count drop by 800×.

The trick

Hu et al. (2021) noticed something surprising: when you fine-tune a large model, the weight updates have a low intrinsic rank. Most of the change in a layer can be captured by a low-rank matrix. So instead of training the full d × d update, train two rectangles B (d × r) and A (r × d) that multiply to approximate it.

The numbers

  • Full fine-tuning of a 7B model: ~14 GB of trainable parameters.
  • LoRA at r=8 on the same model: ~17 MB. 800× reduction.
  • And the resulting model performs nearly identically on most tasks.

Why it works

The singular value spectrum of fine-tuning updates is heavily front-loaded — the top few singular values dwarf the rest. Truncating after rank r captures most of the energy. This is LoRA's empirical foundation; the visualization above shows the spectrum for synthetic data with the same property.

Anchored to 10-fine-tuning/lora-and-qlora.