Ground truth vs relevancy in RAG evaluation

Your RAG scores 95 percent on relevancy and your users still complain

You ran RAGAs. Relevancy came back at 0.94. You shipped it, declared victory, and opened Slack. 30 minutes later the support channel is on fire because answers are confidently wrong. Relevant chunks were retrieved, but the final answer did not match reality.

The bug is that relevancy measures whether the retrieved chunks are topically related to the query. It does not measure whether the answer is actually correct. You need a second metric, one that compares the generated answer to a known-correct reference. That metric requires ground truth.

This post is the ground-truth-versus-relevancy pattern: what each metric actually measures, when to use each, how to build a ground-truth dataset without burning a week, and the 2 numbers every RAG team should track.

Why is relevancy alone not enough?

Because relevancy is a retrieval-quality metric, not an answer-quality metric. It answers "did we retrieve the right chunks," which is a necessary but not sufficient condition for a correct answer. 3 specific failure modes of relevancy-only evaluation:

Correct retrieval, wrong synthesis. The LLM got the right chunks but combined them incorrectly. Relevancy says 0.95, the answer says something untrue.
Retrieval that looks relevant but is not. Cosine similarity is high but the chunk does not actually answer the question. Relevancy says 0.85, the chunks are useless.
No ground-truth comparison. Without a reference answer, you cannot measure answer correctness at all.

Ground truth closes the loop. Relevancy checks the retriever; ground truth checks the whole pipeline.

graph TD
    Query[Query] --> Retrieve[Retriever]
    Retrieve --> Chunks[Retrieved chunks]
    Chunks --> LLM[Generator LLM]
    LLM --> Answer[Generated answer]

    Chunks --> Relev[Relevancy metric]
    Answer --> GT[Ground truth comparison]

    Relev --> R1[Is retrieval good?]
    GT --> R2[Is final answer correct?]

    style Relev fill:#dbeafe,stroke:#1e40af
    style GT fill:#dcfce7,stroke:#15803d

What is relevancy actually measuring?

Relevancy compares the retrieved chunks to the query. Common implementations use an LLM judge to score "how relevant is this chunk to the question" on a 0-1 scale, or cosine similarity between the query embedding and the chunk embedding.

# filename: app/eval/relevancy.py
# description: Simple relevancy score using an LLM judge.
from anthropic import Anthropic

client = Anthropic()

JUDGE_PROMPT = """On a scale of 0.0 to 1.0, how relevant is this chunk to the question?
Return ONLY a number.

Question: {question}
Chunk: {chunk}
"""


def score_relevancy(question: str, chunks: list[str]) -> float:
    scores = []
    for chunk in chunks:
        reply = client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=10,
            messages=[{"role": "user", "content": JUDGE_PROMPT.format(
                question=question, chunk=chunk,
            )}],
        )
        scores.append(float(reply.content[0].text.strip()))
    return sum(scores) / len(scores)

Relevancy is fast and cheap. You do not need a reference answer. It runs on any query in production without prior labeling. That is why every RAG team starts here.

What is ground truth and why do you need it?

Ground truth is a labeled reference answer for a query. "What is our refund policy?" has a ground-truth answer: "We refund within 30 days of purchase for unused items." When the RAG pipeline generates an answer, you compare it against the reference.

# filename: app/eval/ground_truth.py
# description: Score generated answer against a ground-truth reference.
JUDGE_PROMPT = """Compare the generated answer to the reference answer. Score on a scale of 0.0 to 1.0.
1.0 means the generated answer captures every fact in the reference.
0.0 means the generated answer is completely wrong or missing.
Return ONLY a number.

Question: {question}
Reference: {reference}
Generated: {generated}
"""


def score_against_ground_truth(question: str, reference: str, generated: str) -> float:
    reply = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=10,
        messages=[{"role": "user", "content": JUDGE_PROMPT.format(
            question=question, reference=reference, generated=generated,
        )}],
    )
    return float(reply.content[0].text.strip())

Ground truth is slower to build and more expensive to run. You need labeled pairs. But it catches bugs relevancy cannot: incorrect synthesis, hallucinations, and partial answers that use the right chunks but miss key facts.

For the LLM-as-judge pattern these scorers build on, see the LLM-as-a-judge production evaluation framework post.

How do you build a ground-truth dataset without burning a week?

Start with 50 queries, not 500. Here is the sequence:

Pull real queries from production logs. 50 recent user questions sampled across your top intent categories.
Write reference answers with a subject-matter expert. 1 hour per 10 answers if the expert knows the domain. Total: 5 hours for 50.
Store as a simple JSONL file. One line per query-reference pair.
Version it in git. Ground truth drifts as your product changes. A git-tracked dataset is auditable.

# filename: data/eval/ground_truth.jsonl
# description: First 3 lines of a 50-row ground-truth dataset.
{"query": "how do I reset my password?", "reference": "Click Forgot Password on the login screen and check your email for the reset link."}
{"query": "what is the refund policy?", "reference": "Full refund within 30 days of purchase for unused items."}
{"query": "how do I export my data?", "reference": "Settings > Account > Export Data. The export is emailed as a CSV within 10 minutes."}

50 queries is enough to catch regressions. Add more as specific bugs appear. Do not try to label every possible query up front, that way lies never-shipping.

For the broader RAG evaluation framework, see the RAGAs evaluation for RAG pipelines post.

When should you use which metric?

Both, always, but at different cadences.

Relevancy: run continuously in production on every query. It is cheap and catches retriever regressions instantly.
Ground truth: run nightly on your 50-query labeled set. It catches generator regressions and end-to-end bugs.

3 specific decisions:

If relevancy drops but ground truth is stable: your retriever changed (new index, chunk size, embedding model). Investigate retrieval.
If ground truth drops but relevancy is stable: your generator changed (new prompt, new model). Investigate synthesis.
If both drop: something upstream changed (data ingestion, document format). Investigate the pipeline.

Having both metrics turns a vague "RAG is broken" into a 3-way diagnostic.

Why is answer correctness not enough on its own?

Because if you only measure answer correctness, you cannot tell whether a failure is a retriever bug or a generator bug. Relevancy localizes the problem. A production RAG team needs both signals to debug quickly, which is why RAGAs ships both metrics as first-class citizens.

For the complementary faithfulness metric that checks if the answer is grounded in the retrieved chunks, see the Hallucination testing for RAG pipelines post.

What to do Monday morning

Add relevancy scoring to your RAG pipeline if you do not already have it. Haiku plus the 10-line judge prompt is enough.
Sit down with a domain expert for 5 hours and label 50 ground-truth pairs from production logs.
Store the pairs in a JSONL file in git. Version it.
Add a nightly job that runs the full RAG pipeline on the 50 queries and scores against ground truth. Alert on any drop of more than 5 percent.
Run both metrics on your last 2 production deploys. If either dropped, roll back the change that caused it.

The headline: relevancy and ground truth measure different things. Relevancy catches retriever bugs in real time. Ground truth catches generator bugs nightly. You need both, you do not need 500 labeled pairs, and 50 is enough to start.

Frequently asked questions

What is the difference between relevancy and ground truth in RAG?

Relevancy scores whether the retrieved chunks are topically related to the query, usually on a 0-1 scale from an LLM judge or cosine similarity. Ground truth compares the generated final answer to a known-correct reference answer. Relevancy checks the retriever; ground truth checks the full pipeline including synthesis. You need both to debug production RAG systems.

How many ground-truth examples do I need?

50 is the minimum to catch regressions. 200 gives you stable numbers across query categories. Do not start with 500, you will spend a week labeling and then find out half your labels are wrong. Start at 50, run evals, add more ground truth for the specific bug classes you find.

Can I auto-generate ground truth with an LLM?

Partially. An LLM can draft reference answers from your source documents, but a human expert must review every one before it becomes ground truth. Auto-generated references contain the same hallucinations your RAG pipeline produces, which makes them useless for catching generator bugs.

Which RAG evaluation metric should I use first?

Relevancy. It is cheap, needs no labeled data, and runs in production on every query. Ship relevancy first, watch for a week, then layer in ground truth on a 50-row labeled set for nightly evals.

How often should I run ground-truth evaluation?

Nightly in CI, and on every significant change to the retriever, generator, or source documents. Run the full 50-query suite and fail the build if average correctness drops by more than 5 percent compared to the previous run. This catches regressions before they reach users.

Key takeaways

Relevancy measures retriever quality; ground truth measures end-to-end answer correctness. Both are necessary; neither is sufficient alone.
Relevancy is cheap and runs on any query without labels. Use it continuously in production to catch retriever regressions.
Ground truth needs human-labeled reference answers. Use it nightly on a small labeled set (start at 50) to catch generator regressions.
The combination localizes bugs: relevancy drops mean retriever issues, ground truth drops mean generator issues, both dropping means pipeline-level issues.
Store ground truth as git-versioned JSONL. It is auditable and easy to extend as you find new failure modes.
To see ground truth and relevancy wired into a full production RAG stack, walk through the Agentic RAG Masterclass, or start with the RAG Fundamentals primer.

For the RAGAs metric definitions and reference implementations, see the RAGAs documentation.

Your RAG scores 95 percent on relevancy and your users still complain

Why is relevancy alone not enough?

Correct retrieval, wrong synthesis. The LLM got the right chunks but combined them incorrectly. Relevancy says 0.95, the answer says something untrue.
Retrieval that looks relevant but is not. Cosine similarity is high but the chunk does not actually answer the question. Relevancy says 0.85, the chunks are useless.
No ground-truth comparison. Without a reference answer, you cannot measure answer correctness at all.

Ground truth closes the loop. Relevancy checks the retriever; ground truth checks the whole pipeline.

graph TD
    Query[Query] --> Retrieve[Retriever]
    Retrieve --> Chunks[Retrieved chunks]
    Chunks --> LLM[Generator LLM]
    LLM --> Answer[Generated answer]

    Chunks --> Relev[Relevancy metric]
    Answer --> GT[Ground truth comparison]

    Relev --> R1[Is retrieval good?]
    GT --> R2[Is final answer correct?]

    style Relev fill:#dbeafe,stroke:#1e40af
    style GT fill:#dcfce7,stroke:#15803d

What is relevancy actually measuring?

# filename: app/eval/relevancy.py
# description: Simple relevancy score using an LLM judge.
from anthropic import Anthropic

client = Anthropic()

JUDGE_PROMPT = """On a scale of 0.0 to 1.0, how relevant is this chunk to the question?
Return ONLY a number.

Question: {question}
Chunk: {chunk}
"""


def score_relevancy(question: str, chunks: list[str]) -> float:
    scores = []
    for chunk in chunks:
        reply = client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=10,
            messages=[{"role": "user", "content": JUDGE_PROMPT.format(
                question=question, chunk=chunk,
            )}],
        )
        scores.append(float(reply.content[0].text.strip()))
    return sum(scores) / len(scores)

Relevancy is fast and cheap. You do not need a reference answer. It runs on any query in production without prior labeling. That is why every RAG team starts here.

What is ground truth and why do you need it?

# filename: app/eval/ground_truth.py
# description: Score generated answer against a ground-truth reference.
JUDGE_PROMPT = """Compare the generated answer to the reference answer. Score on a scale of 0.0 to 1.0.
1.0 means the generated answer captures every fact in the reference.
0.0 means the generated answer is completely wrong or missing.
Return ONLY a number.

Question: {question}
Reference: {reference}
Generated: {generated}
"""


def score_against_ground_truth(question: str, reference: str, generated: str) -> float:
    reply = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=10,
        messages=[{"role": "user", "content": JUDGE_PROMPT.format(
            question=question, reference=reference, generated=generated,
        )}],
    )
    return float(reply.content[0].text.strip())

For the LLM-as-judge pattern these scorers build on, see the LLM-as-a-judge production evaluation framework post.

How do you build a ground-truth dataset without burning a week?

Start with 50 queries, not 500. Here is the sequence:

Pull real queries from production logs. 50 recent user questions sampled across your top intent categories.
Write reference answers with a subject-matter expert. 1 hour per 10 answers if the expert knows the domain. Total: 5 hours for 50.
Store as a simple JSONL file. One line per query-reference pair.
Version it in git. Ground truth drifts as your product changes. A git-tracked dataset is auditable.

# filename: data/eval/ground_truth.jsonl
# description: First 3 lines of a 50-row ground-truth dataset.
{"query": "how do I reset my password?", "reference": "Click Forgot Password on the login screen and check your email for the reset link."}
{"query": "what is the refund policy?", "reference": "Full refund within 30 days of purchase for unused items."}
{"query": "how do I export my data?", "reference": "Settings > Account > Export Data. The export is emailed as a CSV within 10 minutes."}

50 queries is enough to catch regressions. Add more as specific bugs appear. Do not try to label every possible query up front, that way lies never-shipping.

For the broader RAG evaluation framework, see the RAGAs evaluation for RAG pipelines post.

When should you use which metric?

Both, always, but at different cadences.

Relevancy: run continuously in production on every query. It is cheap and catches retriever regressions instantly.
Ground truth: run nightly on your 50-query labeled set. It catches generator regressions and end-to-end bugs.

3 specific decisions:

If relevancy drops but ground truth is stable: your retriever changed (new index, chunk size, embedding model). Investigate retrieval.
If ground truth drops but relevancy is stable: your generator changed (new prompt, new model). Investigate synthesis.
If both drop: something upstream changed (data ingestion, document format). Investigate the pipeline.

Having both metrics turns a vague "RAG is broken" into a 3-way diagnostic.

Why is answer correctness not enough on its own?

For the complementary faithfulness metric that checks if the answer is grounded in the retrieved chunks, see the Hallucination testing for RAG pipelines post.

What to do Monday morning

Add relevancy scoring to your RAG pipeline if you do not already have it. Haiku plus the 10-line judge prompt is enough.
Sit down with a domain expert for 5 hours and label 50 ground-truth pairs from production logs.
Store the pairs in a JSONL file in git. Version it.
Add a nightly job that runs the full RAG pipeline on the 50 queries and scores against ground truth. Alert on any drop of more than 5 percent.
Run both metrics on your last 2 production deploys. If either dropped, roll back the change that caused it.

Relevancy measures retriever quality; ground truth measures end-to-end answer correctness. Both are necessary; neither is sufficient alone.
Relevancy is cheap and runs on any query without labels. Use it continuously in production to catch retriever regressions.
Ground truth needs human-labeled reference answers. Use it nightly on a small labeled set (start at 50) to catch generator regressions.
The combination localizes bugs: relevancy drops mean retriever issues, ground truth drops mean generator issues, both dropping means pipeline-level issues.
Store ground truth as git-versioned JSONL. It is auditable and easy to extend as you find new failure modes.
To see ground truth and relevancy wired into a full production RAG stack, walk through the Agentic RAG Masterclass, or start with the RAG Fundamentals primer.

For the RAGAs metric definitions and reference implementations, see the RAGAs documentation.

Ground truth vs relevancy in RAG evaluation

Share this post

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Retry patterns for LLM API errors in production

Weekly Bytes of AI

Ready to go deeper?

Ground truth vs relevancy in RAG evaluation

Share this post

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Retry patterns for LLM API errors in production

Weekly Bytes of AI

Ready to go deeper?