Retriever k-value tuning for RAG pipelines

Your retriever returns 5 chunks because the tutorial said 5

Every RAG tutorial uses k=5 for top-k retrieval. You shipped with k=5 because that is what the tutorial showed. Your answers are sometimes wrong because the retriever misses the 6th-ranked chunk that actually contains the answer. You try k=20 to be safe. Now answers are slower, cost more, and the LLM gets confused by irrelevant context. You go back to k=5 and accept the hit.

The fix is to tune k on your own data, not on a tutorial default. The right k depends on your chunk size, your embedding model quality, your question distribution, and your answer latency budget. In practice, the sweet spot is between 3 and 10 for most RAG systems, but which specific number needs measurement.

This post is the 3-step k-tuning process: the eval set you need, the metrics that matter, and the sweet spot you will find between recall and cost.

Why does k matter more than tutorials admit?

Because k sets the trade-off between recall (did the retriever find the relevant chunks?) and precision (how many irrelevant chunks is the LLM wading through?). Too low and you miss relevant content. Too high and you dilute the signal. 3 specific failure modes:

k too low (k=3): Recall@k drops below 80 percent on multi-hop questions. Relevant content is in position 4 or 5 and never reaches the LLM.
k too high (k=20): Token cost triples, latency rises, LLM attention disperses across irrelevant chunks. Quality actually drops.
Fixed k for all query types: Lookup questions need k=3, comparison questions need k=8. One number does not fit all.

Tuning k on your own eval set finds the number that maximizes answer quality within your cost budget.

graph LR
    K3[k=3: high precision, low recall] --> Miss[Misses relevant chunks]
    K20[k=20: high recall, low precision] --> Dilute[Dilutes LLM attention]
    K_tuned[k=tuned from eval] --> Sweet[Sweet spot: recall + cost]

    style Miss fill:#fee2e2,stroke:#b91c1c
    style Dilute fill:#fef3c7,stroke:#b45309
    style Sweet fill:#dcfce7,stroke:#15803d

What does the 3-step tuning process look like?

Step 1: build an eval set with ground truth

50-100 question+expected-document pairs. For each question, record which document(s) contain the answer. This is the tedious step, but you only do it once and it feeds every future tuning.

# filename: eval/retrieval_eval_set.py
# description: Labeled eval set for retriever tuning.
EVAL_SET = [
    {
        'question': 'What is our refund policy?',
        'relevant_doc_ids': ['policy/refunds.md'],
    },
    {
        'question': 'How do I configure the API rate limiter?',
        'relevant_doc_ids': ['docs/rate-limiting.md', 'examples/rate-limiter.py'],
    },
    # ... 50+ more
]

Step 2: measure recall at multiple k values

Run the retriever for each question at k=3, k=5, k=10, k=20. Compute recall@k = fraction of questions where at least one of the relevant_doc_ids is in the top k.

# filename: eval/tune_k.py
# description: Measure recall at multiple k values for retriever tuning.
async def measure_recall_at_k(eval_set, retriever, ks=[3, 5, 10, 20]):
    results = {k: 0 for k in ks}
    for item in eval_set:
        retrieved = await retriever.search(item['question'], k=max(ks))
        for k in ks:
            top_k_ids = {r.doc_id for r in retrieved[:k]}
            if any(rid in top_k_ids for rid in item['relevant_doc_ids']):
                results[k] += 1
    return {k: v / len(eval_set) for k, v in results.items()}

Typical output for a well-tuned retriever:

recall@3 = 0.72
recall@5 = 0.85
recall@10 = 0.93
recall@20 = 0.97

Step 3: find the knee of the curve

Plot recall vs k. The knee is where the curve flattens out. In the example above, going from 5 to 10 gains 8 points of recall. Going from 10 to 20 gains only 4. The knee is around 10.

Pick the lowest k that captures 95 percent of the relevant-at-k=20 signal. That is your sweet spot. In most RAG systems I have tuned, the sweet spot is between 5 and 10.

How does k interact with cost and latency?

Linearly on both. Doubling k roughly doubles the context tokens sent to the final LLM, which doubles the cost and increases latency. A k=10 system costs 2x a k=5 system, so the recall gain has to justify the cost.

Rule of thumb: if going from k=5 to k=10 gains less than 5 points of recall, stay at k=5. If it gains 10+ points, move to k=10.

For the cost optimization that complements k tuning, see the Agent cost optimization from trace data post.

Should k be different for different query types?

Yes, if you can classify queries. 3 buckets:

Lookup questions ("What is the refund policy?"): k=3 is enough. Single chunk usually contains the answer.
Comparison questions ("How does X differ from Y?"): k=8-12. Need content from both sides.
Multi-hop questions ("What led to X and who approved it?"): k=10-15. Content spans multiple documents.

Run a cheap classifier on the question first, then pick k dynamically. A classifier adds ~100 ms and saves 40-60 percent of token cost on lookup questions.

How do you avoid overfitting k to your eval set?

Split the eval set 70/30 into tuning and holdout. Tune k on the 70 percent, validate on the 30 percent. If recall at the chosen k drops significantly on the holdout, the eval set was not representative and you need more diverse questions.

Also rotate the eval set every 3 months. Production queries drift, and a stale eval set hides drift. Add new questions from real user logs and retire old ones that no longer represent the workload.

For the broader RAG evaluation workflow, see the RAGAS evaluation for RAG pipelines post.

What to do Monday morning

Build a labeled eval set of 50-100 question+expected-document pairs from real production queries. This is the one-time investment.
Measure recall@k for k=3, 5, 10, 20. Plot the curve. Find the knee.
Pick the lowest k that captures 95 percent of the k=20 recall. Ship it.
If you have query classification, split k by query type. Save 40-60 percent on lookup questions.
Rotate the eval set every 3 months so drift does not hide in your tuning.
Alert on recall@k dropping below your baseline. A drop means the corpus or model changed and k might need re-tuning.

The headline: k-value tuning is a 3-step process with measurable outcomes. Eval set, recall curve, knee selection. Default k=5 is the tutorial answer. Your answer comes from your data.

Frequently asked questions

Why is the default k=5 usually wrong?

Because tutorials pick k=5 without any tuning on the actual corpus. The right k depends on chunk size, embedding quality, question distribution, and cost budget. For lookup-heavy systems k=3 is often sufficient. For complex multi-hop systems k=10-15 is needed. The only way to know is to measure recall@k on a labeled eval set from your own data.

How do I build a retriever eval set?

Sample 50-100 real questions from production logs. For each, manually identify which documents contain the answer. This labeled ground-truth is the tedious step, but you do it once and use it for every future tuning. Rotate it every 3 months so it reflects current production queries.

What is recall@k and why does it matter?

Recall@k is the fraction of questions for which at least one relevant document appears in the top k retrieved results. It directly measures whether your retriever is finding the right content. A recall@5 of 0.85 means 85 percent of your questions have their answer in the top 5 results. Higher is better.

Should I use different k values for different query types?

Yes. Lookup questions ("what is X") need k=3. Comparison questions ("X vs Y") need k=8-12. Multi-hop questions ("what led to X") need k=10-15. Classify the query first with a cheap model and pick k dynamically. This saves 40-60 percent of token cost on lookup-heavy traffic.

How often should I re-tune k?

Every 3 months, or whenever you change the embedding model, chunk size, or significantly rebuild the corpus. Production question distributions drift over time, and a k value tuned 6 months ago may no longer be optimal. Also rotate the eval set at the same cadence to reflect current queries.

Key takeaways

The default k=5 from tutorials is a guess. The right k for your RAG system depends on your data and needs measurement.
Build a labeled eval set with 50-100 question+expected-document pairs. One-time investment, pays off on every future retrieval change.
Measure recall@k for k=3, 5, 10, 20. Find the knee of the curve. Pick the lowest k that captures 95 percent of k=20 recall.
Split k by query type when possible. Lookup k=3, comparison k=8, multi-hop k=12. Saves cost on the common case.
Rotate the eval set every 3 months. Production queries drift; stale eval sets hide drift.
To see k tuning wired into a full production RAG evaluation pipeline with RAGAS, walk through the Agentic RAG Masterclass, or start with the RAG Fundamentals primer.

For the information retrieval textbook treatment of recall and precision trade-offs, see Manning's Introduction to Information Retrieval, Chapter 8.

Your retriever returns 5 chunks because the tutorial said 5

This post is the 3-step k-tuning process: the eval set you need, the metrics that matter, and the sweet spot you will find between recall and cost.

Why does k matter more than tutorials admit?

k too low (k=3): Recall@k drops below 80 percent on multi-hop questions. Relevant content is in position 4 or 5 and never reaches the LLM.
k too high (k=20): Token cost triples, latency rises, LLM attention disperses across irrelevant chunks. Quality actually drops.
Fixed k for all query types: Lookup questions need k=3, comparison questions need k=8. One number does not fit all.

Tuning k on your own eval set finds the number that maximizes answer quality within your cost budget.

graph LR
    K3[k=3: high precision, low recall] --> Miss[Misses relevant chunks]
    K20[k=20: high recall, low precision] --> Dilute[Dilutes LLM attention]
    K_tuned[k=tuned from eval] --> Sweet[Sweet spot: recall + cost]

    style Miss fill:#fee2e2,stroke:#b91c1c
    style Dilute fill:#fef3c7,stroke:#b45309
    style Sweet fill:#dcfce7,stroke:#15803d

What does the 3-step tuning process look like?

Step 1: build an eval set with ground truth

50-100 question+expected-document pairs. For each question, record which document(s) contain the answer. This is the tedious step, but you only do it once and it feeds every future tuning.

# filename: eval/retrieval_eval_set.py
# description: Labeled eval set for retriever tuning.
EVAL_SET = [
    {
        'question': 'What is our refund policy?',
        'relevant_doc_ids': ['policy/refunds.md'],
    },
    {
        'question': 'How do I configure the API rate limiter?',
        'relevant_doc_ids': ['docs/rate-limiting.md', 'examples/rate-limiter.py'],
    },
    # ... 50+ more
]

Step 2: measure recall at multiple k values

Run the retriever for each question at k=3, k=5, k=10, k=20. Compute recall@k = fraction of questions where at least one of the relevant_doc_ids is in the top k.

# filename: eval/tune_k.py
# description: Measure recall at multiple k values for retriever tuning.
async def measure_recall_at_k(eval_set, retriever, ks=[3, 5, 10, 20]):
    results = {k: 0 for k in ks}
    for item in eval_set:
        retrieved = await retriever.search(item['question'], k=max(ks))
        for k in ks:
            top_k_ids = {r.doc_id for r in retrieved[:k]}
            if any(rid in top_k_ids for rid in item['relevant_doc_ids']):
                results[k] += 1
    return {k: v / len(eval_set) for k, v in results.items()}

Typical output for a well-tuned retriever:

recall@3 = 0.72
recall@5 = 0.85
recall@10 = 0.93
recall@20 = 0.97

Step 3: find the knee of the curve

Plot recall vs k. The knee is where the curve flattens out. In the example above, going from 5 to 10 gains 8 points of recall. Going from 10 to 20 gains only 4. The knee is around 10.

Pick the lowest k that captures 95 percent of the relevant-at-k=20 signal. That is your sweet spot. In most RAG systems I have tuned, the sweet spot is between 5 and 10.

How does k interact with cost and latency?

Rule of thumb: if going from k=5 to k=10 gains less than 5 points of recall, stay at k=5. If it gains 10+ points, move to k=10.

For the cost optimization that complements k tuning, see the Agent cost optimization from trace data post.

Should k be different for different query types?

Yes, if you can classify queries. 3 buckets:

Lookup questions ("What is the refund policy?"): k=3 is enough. Single chunk usually contains the answer.
Comparison questions ("How does X differ from Y?"): k=8-12. Need content from both sides.
Multi-hop questions ("What led to X and who approved it?"): k=10-15. Content spans multiple documents.

Run a cheap classifier on the question first, then pick k dynamically. A classifier adds ~100 ms and saves 40-60 percent of token cost on lookup questions.

How do you avoid overfitting k to your eval set?

Also rotate the eval set every 3 months. Production queries drift, and a stale eval set hides drift. Add new questions from real user logs and retire old ones that no longer represent the workload.

For the broader RAG evaluation workflow, see the RAGAS evaluation for RAG pipelines post.

What to do Monday morning

Build a labeled eval set of 50-100 question+expected-document pairs from real production queries. This is the one-time investment.
Measure recall@k for k=3, 5, 10, 20. Plot the curve. Find the knee.
Pick the lowest k that captures 95 percent of the k=20 recall. Ship it.
If you have query classification, split k by query type. Save 40-60 percent on lookup questions.
Rotate the eval set every 3 months so drift does not hide in your tuning.
Alert on recall@k dropping below your baseline. A drop means the corpus or model changed and k might need re-tuning.

The headline: k-value tuning is a 3-step process with measurable outcomes. Eval set, recall curve, knee selection. Default k=5 is the tutorial answer. Your answer comes from your data.

The default k=5 from tutorials is a guess. The right k for your RAG system depends on your data and needs measurement.
Build a labeled eval set with 50-100 question+expected-document pairs. One-time investment, pays off on every future retrieval change.
Measure recall@k for k=3, 5, 10, 20. Find the knee of the curve. Pick the lowest k that captures 95 percent of k=20 recall.
Split k by query type when possible. Lookup k=3, comparison k=8, multi-hop k=12. Saves cost on the common case.
Rotate the eval set every 3 months. Production queries drift; stale eval sets hide drift.
To see k tuning wired into a full production RAG evaluation pipeline with RAGAS, walk through the Agentic RAG Masterclass, or start with the RAG Fundamentals primer.

For the information retrieval textbook treatment of recall and precision trade-offs, see Manning's Introduction to Information Retrieval, Chapter 8.

Retriever k-value tuning for RAG: the right top-k

Share this post

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Retry patterns for LLM API errors in production

Weekly Bytes of AI

Ready to go deeper?

Retriever k-value tuning for RAG: the right top-k

Share this post

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Retry patterns for LLM API errors in production

Weekly Bytes of AI

Ready to go deeper?