LoRA & QLoRA

Full fine-tuning a 70B model requires hundreds of GB of GPU memory. Most teams can’t do that. LoRA (Low-Rank Adaptation) and QLoRA make fine-tuning feasible on a single consumer GPU.

The idea behind LoRA

Hu et al. (2021). Instead of updating the full weight matrix W of size (d, d), freeze W and learn a low-rank update:

W_new = W + ΔW    where ΔW = B · A

A is (r, d) and B is (d, r), with r << d (rank r typically 4–64). Total trained parameters: 2 · r · d instead of d².

For a 70B model, full fine-tuning is updating 70B parameters. LoRA with r = 16 updates ~0.1% of that — trainable params measured in millions, not billions.

Why it works

The empirical claim, well-supported: the important changes during fine-tuning lie in a low-dim subspace. You don’t need to re-learn the model; you only need to nudge it.

Underlying intuition: pretrained models are already richly capable; fine-tuning adapts a small “direction” of weights. That direction can be approximated by a low-rank matrix.

How it’s applied

LoRA targets specific weight matrices in the transformer. Most commonly:

The attention Q and V projections.
Sometimes also K, O, and FFN matrices.

from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(base_model, config)

The base model’s weights stay frozen (no gradient flow); only A and B for each targeted matrix train.

Hyperparameters

rank r: 4–64. Higher = more capacity, more parameters. Default: 8 or 16.
alpha: scaling factor for the LoRA update. Common: alpha = 2 * r. Effectively a learning-rate-style knob.
target modules: which weights to adapt. Attention is the safe default; FFN can be added if quality plateaus.
dropout: 0.0–0.1.
bias: usually "none". Adapting biases adds tiny benefit.

Inference modes

After training, you have two options:

1. Merge adapter into base weights

merged = model.merge_and_unload()
merged.save_pretrained("./merged_model")

The result is a regular model file. No special inference code needed; works with any framework.

Cost: you lose the ability to swap adapters at runtime.

2. Keep the adapter separate

Load base + adapter at inference time. Lets you serve multiple LoRAs from one base — switch behaviors per request without loading multiple full models.

Tools: vLLM has built-in LoRA serving; LoRAX / TGI / SGLang also support this.

For a B2B product where each customer has their own fine-tune, this pattern is gold.

QLoRA — quantized LoRA

Dettmers et al. (2023). Combine LoRA with 4-bit quantization of the base model.

Base weights stored in 4-bit (NF4 quantization)
Forward pass: dequantize on the fly
LoRA adapter trained in higher precision (fp16/bf16)

Net effect: a 70B model fits in ~40GB instead of 140GB. Training fits on a single 48GB or 80GB GPU.

Quality cost: surprisingly small. NF4 is a careful 4-bit quantization that preserves most signal.

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-70B", quantization_config=bnb_config)

QLoRA enabled the open-source fine-tuning explosion of 2023–2024. It’s now the default for cost-sensitive fine-tuning.

DoRA, rsLoRA, and friends

Variants that improve on plain LoRA:

DoRA (Liu et al. 2024): decompose updates into magnitude and direction. Better than LoRA at the same rank.
rsLoRA (Kalajdzievski 2023): rescaled LoRA. Helps high-rank training.
LoRA+: different learning rates for A and B matrices.
PiSSA: initializes LoRA from SVD of base weights.
NoLA, LoftQ: variants with quantization-aware initialization.

PEFT library (HuggingFace) supports most. For most teams: stick with vanilla LoRA or DoRA; the differences are second-order.

When to use LoRA vs full fine-tuning

Use case	Pick
You have <8 GPUs	LoRA / QLoRA
Small dataset (<10k examples)	LoRA
Need to train multiple variants (per-customer)	LoRA (swap adapters)
Major behavior change desired	full fine-tune if budget allows
Continued pretraining on lots of new domain data	full fine-tune
Tight latency budget (no LoRA inference overhead)	merge LoRA back into base

For 90% of “I want to improve the model on my task” use cases: LoRA works fine. The difference between LoRA and full fine-tuning is often within evaluation noise.

Practical tips

Start with r=8, alpha=16. Tune up if quality plateaus.
Adapt attention only. Add FFN if you need more capacity.
Higher learning rate than full fine-tuning — try 1e-4 to 5e-4 as a starting point.
Use QLoRA if memory is tight — quality cost is usually negligible.
Merge for production unless you specifically need adapter-swapping.
Save adapter weights (adapter_model.bin, often <100MB) — much cheaper to store than full models.

A complete training example

from datasets import load_dataset
from peft import LoraConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from trl import SFTTrainer, SFTConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

dataset = load_dataset("your_data", split="train")

config = SFTConfig(
    output_dir="./out",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    bf16=True,
    logging_steps=10,
    optim="paged_adamw_32bit",
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=lora_config,
    args=config,
    tokenizer=tokenizer,
)
trainer.train()
trainer.save_model("./adapter")

That’s a runnable QLoRA fine-tune on a 24GB+ GPU.