LoRA & QLoRA
Full fine-tuning a 70B model requires hundreds of GB of GPU memory. Most teams can’t do that. LoRA (Low-Rank Adaptation) and QLoRA make fine-tuning feasible on a single consumer GPU.
The idea behind LoRA
Hu et al. (2021). Instead of updating the full weight matrix W of size (d, d), freeze W and learn a low-rank update:
W_new = W + ΔW where ΔW = B · A
A is (r, d) and B is (d, r), with r << d (rank r typically 4–64). Total trained parameters: 2 · r · d instead of d².
For a 70B model, full fine-tuning is updating 70B parameters. LoRA with r = 16 updates ~0.1% of that — trainable params measured in millions, not billions.
Why it works
The empirical claim, well-supported: the important changes during fine-tuning lie in a low-dim subspace. You don’t need to re-learn the model; you only need to nudge it.
Underlying intuition: pretrained models are already richly capable; fine-tuning adapts a small “direction” of weights. That direction can be approximated by a low-rank matrix.
How it’s applied
LoRA targets specific weight matrices in the transformer. Most commonly:
- The attention
QandVprojections. - Sometimes also
K,O, and FFN matrices.
from peft import LoraConfig, get_peft_model
config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(base_model, config)
The base model’s weights stay frozen (no gradient flow); only A and B for each targeted matrix train.
Hyperparameters
- rank
r: 4–64. Higher = more capacity, more parameters. Default: 8 or 16. - alpha: scaling factor for the LoRA update. Common:
alpha = 2 * r. Effectively a learning-rate-style knob. - target modules: which weights to adapt. Attention is the safe default; FFN can be added if quality plateaus.
- dropout: 0.0–0.1.
- bias: usually
"none". Adapting biases adds tiny benefit.
Inference modes
After training, you have two options:
1. Merge adapter into base weights
merged = model.merge_and_unload()
merged.save_pretrained("./merged_model")
The result is a regular model file. No special inference code needed; works with any framework.
Cost: you lose the ability to swap adapters at runtime.
2. Keep the adapter separate
Load base + adapter at inference time. Lets you serve multiple LoRAs from one base — switch behaviors per request without loading multiple full models.
Tools: vLLM has built-in LoRA serving; LoRAX / TGI / SGLang also support this.
For a B2B product where each customer has their own fine-tune, this pattern is gold.
QLoRA — quantized LoRA
Dettmers et al. (2023). Combine LoRA with 4-bit quantization of the base model.
Base weights stored in 4-bit (NF4 quantization)
Forward pass: dequantize on the fly
LoRA adapter trained in higher precision (fp16/bf16)
Net effect: a 70B model fits in ~40GB instead of 140GB. Training fits on a single 48GB or 80GB GPU.
Quality cost: surprisingly small. NF4 is a careful 4-bit quantization that preserves most signal.
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-70B", quantization_config=bnb_config)
QLoRA enabled the open-source fine-tuning explosion of 2023–2024. It’s now the default for cost-sensitive fine-tuning.
DoRA, rsLoRA, and friends
Variants that improve on plain LoRA:
- DoRA (Liu et al. 2024): decompose updates into magnitude and direction. Better than LoRA at the same rank.
- rsLoRA (Kalajdzievski 2023): rescaled LoRA. Helps high-rank training.
- LoRA+: different learning rates for
AandBmatrices. - PiSSA: initializes LoRA from SVD of base weights.
- NoLA, LoftQ: variants with quantization-aware initialization.
PEFT library (HuggingFace) supports most. For most teams: stick with vanilla LoRA or DoRA; the differences are second-order.
When to use LoRA vs full fine-tuning
| Use case | Pick |
|---|---|
| You have <8 GPUs | LoRA / QLoRA |
| Small dataset (<10k examples) | LoRA |
| Need to train multiple variants (per-customer) | LoRA (swap adapters) |
| Major behavior change desired | full fine-tune if budget allows |
| Continued pretraining on lots of new domain data | full fine-tune |
| Tight latency budget (no LoRA inference overhead) | merge LoRA back into base |
For 90% of “I want to improve the model on my task” use cases: LoRA works fine. The difference between LoRA and full fine-tuning is often within evaluation noise.
Practical tips
- Start with
r=8,alpha=16. Tune up if quality plateaus. - Adapt attention only. Add FFN if you need more capacity.
- Higher learning rate than full fine-tuning — try 1e-4 to 5e-4 as a starting point.
- Use QLoRA if memory is tight — quality cost is usually negligible.
- Merge for production unless you specifically need adapter-swapping.
- Save adapter weights (
adapter_model.bin, often <100MB) — much cheaper to store than full models.
A complete training example
from datasets import load_dataset
from peft import LoraConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from trl import SFTTrainer, SFTConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
quantization_config=bnb_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
dataset = load_dataset("your_data", split="train")
config = SFTConfig(
output_dir="./out",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
bf16=True,
logging_steps=10,
optim="paged_adamw_32bit",
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=lora_config,
args=config,
tokenizer=tokenizer,
)
trainer.train()
trainer.save_model("./adapter")
That’s a runnable QLoRA fine-tune on a 24GB+ GPU.