Choosing an embedding model for RAG
You picked an embedding model off a leaderboard and shipped. Now retrieval is bad.
You picked the top embedding model from the MTEB leaderboard, shipped it, and retrieval quality is worse than your old model. You swap to the second-highest and it is slightly different but still not what the benchmark implied. You are left wondering why the "best" model is not actually the best for your corpus.
The answer: benchmarks test on general-purpose text, your corpus is probably domain-specific, and the top model on MTEB might be trained on a distribution that looks nothing like your documentation, your code, or your product. Picking an embedding model by raw leaderboard score is the most common mistake in production RAG.
This post is the 5 criteria that actually matter, the MTEB traps to avoid, the cheap way to compare models on your corpus, and the migration cost nobody warns you about until you are already committed.
Why don't MTEB scores predict production quality?
Because MTEB averages performance across dozens of tasks and domains. A model that is excellent on general web text might be mediocre on your legal documents or your codebase. The leaderboard score is the average; your workload is a single point in the distribution.
3 ways MTEB misleads:
- Average vs specific. The top model on MTEB might score 70 on scientific retrieval and 40 on code. If you are indexing code, the second-ranked model with 55 on scientific and 60 on code is a better choice.
- English-only. Most MTEB runs are on English. A multilingual corpus needs a different model, and the English leaderboard tells you nothing useful.
- Query distribution mismatch. Benchmarks test with query types (short keyword, long question, paragraph) that may not match your users' queries. Model A might beat model B on long-form but lose on short.
You have to test on your own data. Not because benchmarks are bad, but because "your own data" is rarely the distribution a general benchmark covers.
graph TD
Choice[Pick embedding model] -->|wrong way| Leaderboard[Top MTEB score]
Leaderboard --> BadRAG[Mediocre retrieval]
Choice -->|right way| Your[Your eval set]
Your --> Test[Test 4 models side by side]
Test --> Pick[Pick best on YOUR data]
Pick --> GoodRAG[Great retrieval]
style BadRAG fill:#fee2e2,stroke:#b91c1c
style GoodRAG fill:#dcfce7,stroke:#15803d
The benchmark narrows the candidate pool. Your eval set picks the winner.
What are the 5 criteria that actually matter?
-
Retrieval accuracy on your corpus. The primary metric. Measure with a small labeled eval set (50 to 100 questions with known relevant documents). Everything else is secondary.
-
Embedding dimensionality. Higher dimensions (1024, 1536, 3072) capture more nuance but cost more to store and search. Lower dimensions (384, 768) are cheaper and often good enough. The right choice depends on your vector store size and query latency budget.
-
Max sequence length. Short-context models (512 tokens) truncate long chunks silently. Long-context models (8192+ tokens) handle big chunks but charge for the capacity. Match the max sequence length to your chunk size.
-
Cost per million tokens. Hosted models (OpenAI, Voyage, Cohere) charge per token at index time and query time. Self-hosted models (BGE, mxbai, E5) are free at inference but need GPU or a high-memory CPU. Do the math on your expected query volume before committing.
-
License and data residency. Some models are open-weight (MIT, Apache), some are gated (custom license requiring agreement), some are only available via API. For regulated industries, the license can rule out a model entirely regardless of quality.
Skip any one of these and you pick the wrong model. Apply all 5 and the candidate pool shrinks to 2 or 3 models you can test seriously.
How do you test embedding models on your own corpus?
Build a small labeled set of 50 to 100 question-document pairs from your real corpus. For each model under consideration, embed the corpus, run the queries, and measure recall@5 (the fraction of queries where the correct document is in the top 5 results).
# filename: embedding_bakeoff.py
# description: Compare embedding models by recall@5 on a labeled eval set.
# Swap model names to add candidates.
from sentence_transformers import SentenceTransformer
import numpy as np
EVAL_SET = [
{'query': 'how does auth middleware validate tokens',
'correct_doc_id': 'auth_overview.md'},
# ... 50 more pairs
]
def recall_at_k(model_name: str, corpus: dict[str, str], eval_set: list, k: int = 5) -> float:
model = SentenceTransformer(model_name)
doc_ids = list(corpus.keys())
doc_vecs = model.encode([corpus[d] for d in doc_ids], normalize_embeddings=True)
hits = 0
for item in eval_set:
qv = model.encode(item['query'], normalize_embeddings=True)
sims = doc_vecs @ qv
top_k = np.argsort(sims)[-k:]
top_ids = [doc_ids[i] for i in top_k]
if item['correct_doc_id'] in top_ids:
hits += 1
return hits / len(eval_set)
MODELS = [
'BAAI/bge-large-en-v1.5',
'intfloat/e5-large-v2',
'mixedbread-ai/mxbai-embed-large-v1',
'sentence-transformers/all-mpnet-base-v2',
]
for m in MODELS:
score = recall_at_k(m, corpus_dict, EVAL_SET)
print(f'{m}: {score:.3f}')
Run this once per candidate model. The best model on your data wins, regardless of its MTEB rank. Expect the spread to be 5 to 15 points between the best and worst candidates, which is much larger than most teams assume.
Which embedding model families are worth testing?
5 families cover 95 percent of production RAG use cases. All of them have open-weight options you can self-host.
-
BGE (BAAI General Embedding). Strong across most tasks, multiple sizes (small/base/large), open-weight. A safe default for English and multilingual workloads.
-
E5 (Microsoft). Similar positioning to BGE. Strong on long-context queries and question-answer retrieval.
-
mxbai-embed-large. A newer entry with strong MTEB performance and a permissive license. Worth trying alongside BGE.
-
OpenAI text-embedding-3. API-only but good quality and 3 sizes. Reasonable if you are already a heavy OpenAI customer and do not want to self-host.
-
Voyage AI. Specialized for code and domain-specific corpora. Worth testing if you index source code or legal text; often beats general-purpose models on those domains.
Skip old favorites unless you have a specific reason: all-MiniLM-L6-v2 is fast but outclassed, all-mpnet-base-v2 is a decent baseline but the current generation is 5 to 10 points better on most tasks.
For the chunking side of retrieval quality, see the RecursiveCharacterTextSplitter Deep Dive post. For the broader RAG stack, the free RAG Fundamentals primer is the right starting point.
What is the embedding migration cost?
Re-indexing the entire corpus. Every stored embedding has to be regenerated because vectors from different models are not interchangeable. For a 10-million-document corpus, that is a multi-hour batch job at best and a multi-day one at worst depending on cost and compute.
3 hidden costs people forget:
- Index downtime or dual-write. While you are building the new index, you are either running the old one (old quality) or maintaining both (double cost). Pick your strategy before you start.
- Token cost of re-embedding. Hosted models charge per million tokens on re-indexing. For a large corpus, this can be hundreds of dollars.
- Re-tuning of reranking and thresholds. New embeddings produce different similarity score distributions. Your cosine threshold of 0.75 that worked for the old model might be wrong for the new one. Re-tune empirically.
The rule I use: never migrate embeddings unless the new model gives at least a 5-point recall@5 improvement on your eval set. Below that, the engineering cost outweighs the quality gain.
When should you fine-tune an embedding model?
Rarely. Only when off-the-shelf models score below 50 on your eval set and you have thousands of labeled query-document pairs available. Below that threshold, fine-tuning is expensive and risks overfitting.
When fine-tuning is the right call, use sentence-transformers with contrastive learning on your labeled pairs. Train for a few epochs on a GPU, evaluate on a held-out set, and compare to the base model. Expect 5 to 10 points of lift if the corpus is truly domain-specific (legal, medical, code) and less than 5 points otherwise.
For most teams, swapping to a better pretrained model is a bigger win than fine-tuning the old one. Fine-tuning is the last lever, not the first.
What to do Monday morning
- Pull 50 question-document pairs from your real users. Hand-label each with the correct source document. This is the eval set you will use for every future decision.
- Run the bakeoff script from this post against 4 candidate models: BGE, E5, mxbai, and whatever you are using now. Measure recall@5 on your eval set.
- If the best candidate beats your current model by more than 5 points, plan the migration. Anything less is not worth the engineering cost.
- Match your chunk size to the chosen model's max sequence length. Undersized chunks are wasteful; oversized chunks are silently truncated.
- After migration, re-tune your cosine thresholds and reranker weights on the new score distribution. Old thresholds will be wrong because different models produce different score scales.
The headline: the best embedding model is the one that scores highest on your corpus, not on MTEB. A 30-minute bakeoff on 50 labeled pairs will tell you more than any leaderboard. Do it before you pick.
Frequently asked questions
Why don't MTEB leaderboard scores predict production RAG quality?
Because MTEB averages across many tasks and domains. A model that is best on average may be mediocre on your specific corpus. Production quality depends on the match between your document distribution and the model's training distribution, which the leaderboard cannot capture. Always test candidates on a labeled set from your real data before choosing.
What criteria should I use to pick an embedding model for RAG?
5 criteria: retrieval accuracy on your corpus (measured with a labeled eval set), embedding dimensionality (affects storage and query cost), max sequence length (should match your chunk size), cost per million tokens (or GPU requirements if self-hosted), and license compatibility (open vs API-only, data residency). Skip any of these and you pick the wrong model.
Which embedding models are worth testing in 2026?
BGE, E5, mxbai-embed-large for open-weight options, OpenAI text-embedding-3 if you prefer an API, and Voyage AI for code or domain-specific corpora. These 5 families cover 95 percent of production RAG workloads. Older models like all-MiniLM-L6-v2 are fast but significantly outperformed by the current generation.
How do you migrate to a new embedding model?
Re-embed the entire corpus with the new model, build a new index alongside the old one, validate quality on your eval set, then cut over. Do not migrate unless the new model beats the old one by at least 5 points on recall@5. Below that threshold the engineering cost (re-indexing, re-tuning thresholds, dual running) outweighs the quality gain.
When should I fine-tune an embedding model?
Rarely. Only when off-the-shelf models score below 50 on your eval set and you have thousands of labeled query-document pairs. Fine-tuning costs compute, risks overfitting, and usually beats only domain-specific baselines by 5 to 10 points. For most teams, picking a better pretrained model is a bigger win than fine-tuning.
Key takeaways
- MTEB leaderboard scores are averages across tasks. Your corpus is a single point in that distribution. Always test on your own data before picking.
- The 5 criteria that matter: accuracy on your eval set, dimensionality, max sequence length, cost, and license. Skip any one and you pick the wrong model.
- Build a 50-pair labeled eval set once. Reuse it for every future model decision. It is the single most valuable asset in your RAG stack.
- BGE, E5, mxbai, OpenAI text-embedding-3, and Voyage AI cover 95 percent of production use cases. Run a bakeoff between these 5 families to find your best match.
- Only migrate to a new model if it beats the old one by 5+ points on recall@5. Re-indexing, re-tuning thresholds, and dual running are real costs.
- To see embedding choice wired into a full production RAG stack with chunking, reranking, and grounding, walk through the Agentic RAG Masterclass, or start with the RAG Fundamentals primer.
For the current MTEB leaderboard and per-task breakdowns that help you narrow candidates by specific capability, see the MTEB leaderboard on Hugging Face. Use the per-task scores, not the average, when your workload is domain-specific.
Continue Reading
Ready to go deeper?
Go beyond articles. Build production AI systems with hands-on workshops and our intensive AI Bootcamp.