Embedding Fine-Tuning

Generic embedding models are great. They’re also trained to be average across all tasks — not specialized to yours. Fine-tuning an embedding model on your data is one of the highest-ROI improvements available for retrieval and RAG.

When to fine-tune embeddings

  • Your retrieval recall@10 is below 80% with a generic model.
  • Your domain has specialized vocabulary (legal, medical, code, internal product).
  • You have user behavior signals (clicks, dwell time) you can use as supervision.
  • Your queries and documents differ significantly in style (queries are short questions, docs are long technical text).

A typical domain fine-tune: 5–15 percentage points improvement on recall@10.

The training signal

You need positive pairs: (query, relevant_passage). Optionally negatives.

Sources:

  • Real user data: search logs where users clicked → click is a positive signal.
  • Synthetic queries: have an LLM generate a question for each document.
  • Existing labels: if you have labeled (q, doc) pairs, use them directly.
  • Knowledge base structure: titles ↔ summaries, sections ↔ subsections.

Contrastive training

The classic loss: pull positives close, push negatives apart.

L = −log( exp(sim(q, p+)/τ) / Σ_k exp(sim(q, p_k)/τ) )

Where p+ is the positive and {p_k} includes the positive plus negatives.

In-batch negatives: every other example in the batch becomes a negative. Cheap and effective. Bigger batch = more (and harder) negatives.

Hard negatives: explicitly mine confusable-but-irrelevant docs. Often the difference between “okay” and “great” fine-tuning.

Hard negatives can be mined with the base model: retrieve top-100 for each query, label any non-positive as a hard negative.

Triplet loss

Older but still common:

L = max(0, d(q, p+) − d(q, p−) + margin)

You provide (query, positive, negative) triplets.

Cosine vs InfoNCE

  • Cosine embedding loss: uses cosine similarity directly. Simple.
  • InfoNCE: softmax over similarities. More principled; usually better with proper batch construction.

InfoNCE is the modern default.

Tools

sentence-transformers

The classic library:

from sentence_transformers import SentenceTransformer, losses, InputExample
from torch.utils.data import DataLoader

model = SentenceTransformer("all-MiniLM-L6-v2")
train_examples = [
    InputExample(texts=[query, positive_passage]) for query, positive_passage in pairs
]
loader = DataLoader(train_examples, batch_size=16, shuffle=True)
loss = losses.MultipleNegativesRankingLoss(model)

model.fit(train_objectives=[(loader, loss)], epochs=3, warmup_steps=100)

MultipleNegativesRankingLoss is the in-batch-negative InfoNCE variant.

LLM-based embedding models

For decoder-based embedders (NV-Embed, GritLM, E5-Mistral, Qwen3-Embedding):

  • Same contrastive paradigm, but you usually use LoRA on top of the LLM.
  • Larger and more capable starting points.
  • Tools: mteb for eval, custom scripts for training.

Voyage / Cohere fine-tuning APIs

Some commercial embedders support fine-tuning via their APIs. Easy if you don’t want to manage GPUs.

Hyperparameters

  • Batch size: as large as fits — 32 to 128 typically. Bigger = harder negatives.
  • Learning rate: 2e-5 to 5e-5 for full FT; 1e-4 to 3e-4 for LoRA.
  • Temperature τ (in InfoNCE): 0.05 to 0.1 typical.
  • Epochs: 1–5.
  • Warmup: 5–10%.

Mining hard negatives

Iterate:

  1. Train v0 with random / in-batch negatives.
  2. Use v0 to retrieve top-100 for each query.
  3. Treat top-K (excluding positives) as hard negatives.
  4. Train v1 with those hard negatives.
  5. Repeat.

Each round typically yields a smaller boost. 1–2 rounds usually enough.

Synthetic query generation

If you don’t have real query data, generate it:

for chunk in corpus:
    questions = llm(f"""
    Write 3 different questions a user might ask that would be answered by this passage:

    {chunk}
    """)
    for q in questions:
        pairs.append((q, chunk))

Use a strong LLM. Diversify question styles (short, long, multi-hop, paraphrase). Validate quality manually on a sample.

Multi-task fine-tuning

Train on multiple objectives at once:

  • (query, passage) for retrieval.
  • (passage, summary) for summary embedding.
  • (passage_a, passage_b) where similar paraphrases pair together.

Often produces a more general-purpose model.

Eval matters more than training

Build a golden retrieval eval set before training. Run it on the base model. Train. Run again. Compare. Iterate.

If you can’t measure improvement, you can’t get it.

Common pitfalls

  • Too few negatives → easy task → small gains.
  • Negatives are too easy → model can’t learn what’s hard.
  • Synthetic queries don’t match real distribution → improves on synthetic eval, fails in production.
  • Forgetting to evaluate on out-of-domain queries → model overfits to your training distribution.
  • Bigger model isn’t always better → pick the smallest that meets your quality bar.

Cost-benefit check

Before fine-tuning embeddings, try cheaper options:

  1. Try a stronger general embedder (e.g. Voyage-3-large vs MiniLM).
  2. Try a domain-specialized off-the-shelf model.
  3. Add hybrid search + reranking.
  4. Improve chunking and metadata.

If you’ve done these and recall is still inadequate, fine-tuning is justified.

See also