demo
Fine-tune a 7B model in 17 MB
LoRA: instead of updating a giant weight matrix, train two thin slices that multiply to approximate the update. Slide the rank, watch the approximation get sharper, see the parameter count drop by 800×.
The trick
Hu et al. (2021) noticed something surprising: when you fine-tune
a large model, the weight updates have a low intrinsic rank.
Most of the change in a layer can be captured by a low-rank
matrix. So instead of training the full d × d update,
train two rectangles B (d × r) and A (r × d)
that multiply to approximate it.
The numbers
- Full fine-tuning of a 7B model: ~14 GB of trainable parameters.
- LoRA at r=8 on the same model: ~17 MB. 800× reduction.
- And the resulting model performs nearly identically on most tasks.
Why it works
The singular value spectrum of fine-tuning updates is heavily front-loaded — the top few singular values dwarf the rest. Truncating after rank r captures most of the energy. This is LoRA's empirical foundation; the visualization above shows the spectrum for synthetic data with the same property.
Anchored to 10-fine-tuning/lora-and-qlora.