Embedding Fine-Tuning

Generic embedding models are great. They’re also trained to be average across all tasks — not specialized to yours. Fine-tuning an embedding model on your data is one of the highest-ROI improvements available for retrieval and RAG.

When to fine-tune embeddings

Your retrieval recall@10 is below 80% with a generic model.
Your domain has specialized vocabulary (legal, medical, code, internal product).
You have user behavior signals (clicks, dwell time) you can use as supervision.
Your queries and documents differ significantly in style (queries are short questions, docs are long technical text).

A typical domain fine-tune: 5–15 percentage points improvement on recall@10.

The training signal

You need positive pairs: (query, relevant_passage). Optionally negatives.

Sources:

Real user data: search logs where users clicked → click is a positive signal.
Synthetic queries: have an LLM generate a question for each document.
Existing labels: if you have labeled (q, doc) pairs, use them directly.
Knowledge base structure: titles ↔ summaries, sections ↔ subsections.

Contrastive training

The classic loss: pull positives close, push negatives apart.

L = −log( exp(sim(q, p+)/τ) / Σ_k exp(sim(q, p_k)/τ) )

Where p+ is the positive and {p_k} includes the positive plus negatives.

In-batch negatives: every other example in the batch becomes a negative. Cheap and effective. Bigger batch = more (and harder) negatives.

Hard negatives: explicitly mine confusable-but-irrelevant docs. Often the difference between “okay” and “great” fine-tuning.

Hard negatives can be mined with the base model: retrieve top-100 for each query, label any non-positive as a hard negative.

Triplet loss

Older but still common:

L = max(0, d(q, p+) − d(q, p−) + margin)

You provide (query, positive, negative) triplets.

Cosine vs InfoNCE

Cosine embedding loss: uses cosine similarity directly. Simple.
InfoNCE: softmax over similarities. More principled; usually better with proper batch construction.

InfoNCE is the modern default.

Tools

sentence-transformers

The classic library:

from sentence_transformers import SentenceTransformer, losses, InputExample
from torch.utils.data import DataLoader

model = SentenceTransformer("all-MiniLM-L6-v2")
train_examples = [
    InputExample(texts=[query, positive_passage]) for query, positive_passage in pairs
]
loader = DataLoader(train_examples, batch_size=16, shuffle=True)
loss = losses.MultipleNegativesRankingLoss(model)

model.fit(train_objectives=[(loader, loss)], epochs=3, warmup_steps=100)

MultipleNegativesRankingLoss is the in-batch-negative InfoNCE variant.

LLM-based embedding models

For decoder-based embedders (NV-Embed, GritLM, E5-Mistral, Qwen3-Embedding):

Same contrastive paradigm, but you usually use LoRA on top of the LLM.
Larger and more capable starting points.
Tools: mteb for eval, custom scripts for training.

Voyage / Cohere fine-tuning APIs

Some commercial embedders support fine-tuning via their APIs. Easy if you don’t want to manage GPUs.

Hyperparameters

Batch size: as large as fits — 32 to 128 typically. Bigger = harder negatives.
Learning rate: 2e-5 to 5e-5 for full FT; 1e-4 to 3e-4 for LoRA.
Temperature τ (in InfoNCE): 0.05 to 0.1 typical.
Epochs: 1–5.
Warmup: 5–10%.

Mining hard negatives

Iterate:

Train v0 with random / in-batch negatives.
Use v0 to retrieve top-100 for each query.
Treat top-K (excluding positives) as hard negatives.
Train v1 with those hard negatives.
Repeat.

Each round typically yields a smaller boost. 1–2 rounds usually enough.

Synthetic query generation

If you don’t have real query data, generate it:

for chunk in corpus:
    questions = llm(f"""
    Write 3 different questions a user might ask that would be answered by this passage:

    {chunk}
    """)
    for q in questions:
        pairs.append((q, chunk))

Use a strong LLM. Diversify question styles (short, long, multi-hop, paraphrase). Validate quality manually on a sample.

Multi-task fine-tuning

Train on multiple objectives at once:

(query, passage) for retrieval.
(passage, summary) for summary embedding.
(passage_a, passage_b) where similar paraphrases pair together.

Often produces a more general-purpose model.

Eval matters more than training

Build a golden retrieval eval set before training. Run it on the base model. Train. Run again. Compare. Iterate.

If you can’t measure improvement, you can’t get it.

Common pitfalls

Too few negatives → easy task → small gains.
Negatives are too easy → model can’t learn what’s hard.
Synthetic queries don’t match real distribution → improves on synthetic eval, fails in production.
Forgetting to evaluate on out-of-domain queries → model overfits to your training distribution.
Bigger model isn’t always better → pick the smallest that meets your quality bar.

Cost-benefit check

Before fine-tuning embeddings, try cheaper options:

Try a stronger general embedder (e.g. Voyage-3-large vs MiniLM).
Try a domain-specialized off-the-shelf model.
Add hybrid search + reranking.
Improve chunking and metadata.

If you’ve done these and recall is still inadequate, fine-tuning is justified.