Embedding Fine-Tuning
Generic embedding models are great. They’re also trained to be average across all tasks — not specialized to yours. Fine-tuning an embedding model on your data is one of the highest-ROI improvements available for retrieval and RAG.
When to fine-tune embeddings
- Your retrieval recall@10 is below 80% with a generic model.
- Your domain has specialized vocabulary (legal, medical, code, internal product).
- You have user behavior signals (clicks, dwell time) you can use as supervision.
- Your queries and documents differ significantly in style (queries are short questions, docs are long technical text).
A typical domain fine-tune: 5–15 percentage points improvement on recall@10.
The training signal
You need positive pairs: (query, relevant_passage). Optionally negatives.
Sources:
- Real user data: search logs where users clicked → click is a positive signal.
- Synthetic queries: have an LLM generate a question for each document.
- Existing labels: if you have labeled (q, doc) pairs, use them directly.
- Knowledge base structure: titles ↔ summaries, sections ↔ subsections.
Contrastive training
The classic loss: pull positives close, push negatives apart.
L = −log( exp(sim(q, p+)/τ) / Σ_k exp(sim(q, p_k)/τ) )
Where p+ is the positive and {p_k} includes the positive plus negatives.
In-batch negatives: every other example in the batch becomes a negative. Cheap and effective. Bigger batch = more (and harder) negatives.
Hard negatives: explicitly mine confusable-but-irrelevant docs. Often the difference between “okay” and “great” fine-tuning.
Hard negatives can be mined with the base model: retrieve top-100 for each query, label any non-positive as a hard negative.
Triplet loss
Older but still common:
L = max(0, d(q, p+) − d(q, p−) + margin)
You provide (query, positive, negative) triplets.
Cosine vs InfoNCE
- Cosine embedding loss: uses cosine similarity directly. Simple.
- InfoNCE: softmax over similarities. More principled; usually better with proper batch construction.
InfoNCE is the modern default.
Tools
sentence-transformers
The classic library:
from sentence_transformers import SentenceTransformer, losses, InputExample
from torch.utils.data import DataLoader
model = SentenceTransformer("all-MiniLM-L6-v2")
train_examples = [
InputExample(texts=[query, positive_passage]) for query, positive_passage in pairs
]
loader = DataLoader(train_examples, batch_size=16, shuffle=True)
loss = losses.MultipleNegativesRankingLoss(model)
model.fit(train_objectives=[(loader, loss)], epochs=3, warmup_steps=100)
MultipleNegativesRankingLoss is the in-batch-negative InfoNCE variant.
LLM-based embedding models
For decoder-based embedders (NV-Embed, GritLM, E5-Mistral, Qwen3-Embedding):
- Same contrastive paradigm, but you usually use LoRA on top of the LLM.
- Larger and more capable starting points.
- Tools:
mtebfor eval, custom scripts for training.
Voyage / Cohere fine-tuning APIs
Some commercial embedders support fine-tuning via their APIs. Easy if you don’t want to manage GPUs.
Hyperparameters
- Batch size: as large as fits — 32 to 128 typically. Bigger = harder negatives.
- Learning rate: 2e-5 to 5e-5 for full FT; 1e-4 to 3e-4 for LoRA.
- Temperature τ (in InfoNCE): 0.05 to 0.1 typical.
- Epochs: 1–5.
- Warmup: 5–10%.
Mining hard negatives
Iterate:
- Train v0 with random / in-batch negatives.
- Use v0 to retrieve top-100 for each query.
- Treat top-K (excluding positives) as hard negatives.
- Train v1 with those hard negatives.
- Repeat.
Each round typically yields a smaller boost. 1–2 rounds usually enough.
Synthetic query generation
If you don’t have real query data, generate it:
for chunk in corpus:
questions = llm(f"""
Write 3 different questions a user might ask that would be answered by this passage:
{chunk}
""")
for q in questions:
pairs.append((q, chunk))
Use a strong LLM. Diversify question styles (short, long, multi-hop, paraphrase). Validate quality manually on a sample.
Multi-task fine-tuning
Train on multiple objectives at once:
- (query, passage) for retrieval.
- (passage, summary) for summary embedding.
- (passage_a, passage_b) where similar paraphrases pair together.
Often produces a more general-purpose model.
Eval matters more than training
Build a golden retrieval eval set before training. Run it on the base model. Train. Run again. Compare. Iterate.
If you can’t measure improvement, you can’t get it.
Common pitfalls
- Too few negatives → easy task → small gains.
- Negatives are too easy → model can’t learn what’s hard.
- Synthetic queries don’t match real distribution → improves on synthetic eval, fails in production.
- Forgetting to evaluate on out-of-domain queries → model overfits to your training distribution.
- Bigger model isn’t always better → pick the smallest that meets your quality bar.
Cost-benefit check
Before fine-tuning embeddings, try cheaper options:
- Try a stronger general embedder (e.g. Voyage-3-large vs MiniLM).
- Try a domain-specialized off-the-shelf model.
- Add hybrid search + reranking.
- Improve chunking and metadata.
If you’ve done these and recall is still inadequate, fine-tuning is justified.