RAG evaluation: proving your system actually works

We've spent a lot of time building complex RAG systems. We've optimized chunking, added web search, and even built self-correcting agents. Each step felt like an improvement.

But how do we know for sure?

If you change your chunking strategy, how can you prove to your boss that it resulted in a 10% increase in answer quality? Relying on a few example queries isn't enough. A change that improves one answer might make ten others worse.

To build reliable AI, we must move from anecdotal evidence ("it feels better") to quantitative measurement. RAG evaluation is the science of defining what "good" means and then systematically scoring our system against those criteria. It's the difference between being a hobbyist and an engineer.

The RAG triad: what do we measure?

Before we test, we need to know what to measure. A "good" RAG system balances three core components, often called the RAG Triad:

Context Relevance (Retrieval Quality): Did our retriever find the right information? Did it find all the info needed (Context Recall) and not include irrelevant junk (Context Precision)?
Faithfulness (Generation Quality): Did the LLM's answer stick to the facts from the retrieved context? Or did it "hallucinate" and make things up?
Answer Correctness (Overall Quality): Was the final answer actually correct and relevant to the user's question?

graph TD
    A[User Query] --> B(1. Retrieval)
    B -- "Did we find the right info?" --> C(Context Relevance)
    B --> D(2. Generation)
    D -- "Did the LLM stick to the context?" --> E(Faithfulness)
    D -- "Was the final answer correct?" --> F(Answer Correctness)
    
    style C fill:#f9f,stroke:#333,stroke-width:2px
    style E fill:#f9f,stroke:#333,stroke-width:2px
    style F fill:#f9f,stroke:#333,stroke-width:2px

We need to score our system on all of these metrics to get a complete picture.

How do you build a test pipeline and test questions?

To evaluate a system, we first need a system to test. We'll build a simple RAG pipeline based on a tiny knowledge base.

# filename: example.py
# description: Code example from the post.
# 1. Our "knowledge base" (just a few facts)
documents = [
    "The first wizarding war ended when Lord Voldemort's Killing Curse rebounded...",
    "Harry Potter was left with a lightning-bolt scar...",
    "The three Unforgivable Curses are the Imperius Curse, the Cruciatus Curse, and the Killing Curse."
]

# (Code to add these docs to a ChromaDB collection)

# 2. Our simple RAG pipeline function
def simple_rag_pipeline(question):
    # Retrieve
    retrieved_docs = collection.query(query_texts=[question], n_results=2)['documents'][0]
    context = "\n".join(retrieved_docs)
    
    # Generate
    prompt = f"Context:\n{context}\n\nQuestion: {question}"
    answer = call_llm(prompt) # call_llm is our function to talk to OpenAI
    
    return {"answer": answer, "contexts": retrieved_docs}

Next, and most importantly, we need an evaluation dataset. This includes questions and, for the best results, the "perfect" answers (called ground_truth).

Crucially, we must include questions that cannot be answered by our documents. This tests if the system knows what it doesn't know.

# Our "test sheet"
eval_questions = [
    "What caused the first wizarding war to end?",
    "What are the Unforgivable Curses?",
    "Who was the Minister for Magic during the first war?" # Not in our documents!
]

ground_truths = [
    "The war ended when Voldemort's Killing Curse rebounded on him.",
    "The Imperius, Cruciatus, and Killing Curses.",
    "The provided context does not mention the Minister for Magic."
]

How do you run the evaluation with Ragas?

We could grade these answers by hand, but it's slow and subjective. Instead, we'll use a framework called RAGAs.

RAGAs acts as an automated "judge". It uses powerful LLMs (like GPT-4) to read the question, the retrieved_contexts, the generated answer, and the ground_truth and then score our system on the metrics we defined.

First, we run our pipeline to get the outputs we want to score:

# 1. Run our pipeline on all questions
generated_data = []
for q in eval_questions:
    result = simple_rag_pipeline(q)
    generated_data.append(result)

# 2. Format the data for RAGAs
from datasets import Dataset

ragas_dataset_dict = {
    'question': eval_questions,
    'answer': [d['answer'] for d in generated_data],
    'contexts': [d['contexts'] for d in generated_data],
    'ground_truth': ground_truths
}

ragas_dataset = Dataset.from_dict(ragas_dataset_dict)

Now, we just hand this dataset to RAGAs and ask it to evaluate.

from ragas import evaluate
from ragas.metrics import faithfulness, answer_correctness, context_recall, context_precision

# 3. Run the evaluation!
result = evaluate(
    dataset=ragas_dataset,
    metrics=[
        context_precision,  # Did we retrieve relevant junk?
        context_recall,     # Did we retrieve all the right info?
        faithfulness,       # Did the answer stick to the context?
        answer_correctness  # Was the answer correct?
    ]
)

# 4. Display the results
print(result.to_pandas())

The output is a clean table of scores for each question, showing us exactly where our system is weak. For the "Minister for Magic" question, the context_recall score would be very low (close to 0.0) because our retriever failed to find the necessary information, and RAGAs correctly identified this failure.

How do you prove an optimization works with a/b testing?

This is where evaluation becomes a superpower. Let's make a change and prove if it's better.

Hypothesis: Retrieving more documents (n_results=3 instead of 2) will improve our answers.

We create an advanced_rag_pipeline that's identical, but retrieves 3 documents. We run it, evaluate it, and compare the results.

Question	RAG (k=2) `answer_correctness`	RAG (k=3) `answer_correctness`
"What caused the war to end?"	0.95	0.96
"What are the Curses?"	0.51	0.92
"Who was the Minister?"	0.88	0.88

Conclusion: Look at that! For the "Curses" question, retrieving only 2 documents wasn't enough, so the answer was incomplete (score: 0.51). Our "advanced" pipeline, retrieving 3 documents, got all the needed context and scored a 0.92.

We have just quantitatively proven that our change (k=3) is a significant improvement.

Frequently asked questions

Why can't I just manually test my RAG system

Manual testing with a few examples misses regressions. A change that fixes one answer might break ten others. You need an evaluation dataset (hundreds of questions) and automated scoring to catch failures across context relevance, faithfulness, and answer correctness. RAGAS provides that systematic approach. This is the difference between hobbyist systems relying on anecdotal evidence and reliable, engineer-grade architecture.

What's the difference between context relevance and answer correctness

Context relevance measures retrieval quality: did the retriever find needed information? Answer correctness measures final output: is the answer actually right? These are independent failures. You can retrieve perfectly but have the LLM hallucinate, or answer correctly by luck despite poor retrieval. The RAG Triad separates these so you know which part of your pipeline to optimize.

How do I prove my RAG optimization actually worked

Use A/B testing with RAGAS evaluation. Create two pipeline versions (e.g., k=2 vs k=3 retrieval depth), run both on your evaluation dataset, and compare metric scores. The post shows retrieving 3 documents improved one question from 0.51 to 0.92. RAGAS acts as the automated judge, eliminating manual scoring. This is quantitative proof, not anecdotal improvement claims.

For the full reference, see the Ragas documentation.

Key takeaways

If you can't measure it, you can't improve it: Evaluation is the core discipline of building production AI. Stop "feeling" and start measuring.
The RAG triad is your guide: Focus on your Retrieval Quality (Context Precision/Recall) and your Generation Quality (Faithfulness/Correctness) to get a complete picture.
Frameworks automate judging: Tools like RAGAs automate the complex, expensive task of LLM-based evaluation, letting you test and iterate rapidly.
Evaluation is comparative: The true power of evaluation is in A/B testing, proving that a change to your system leads to a measurable improvement.

For more on building production AI systems, check out our AI Bootcamp for Software Engineers.

Take the next step

RAG Fundamentals Workshop, Build and evaluate a production RAG pipeline hands-on

We've spent a lot of time building complex RAG systems. We've optimized chunking, added web search, and even built self-correcting agents. Each step felt like an improvement.

But how do we know for sure?

The RAG triad: what do we measure?

Before we test, we need to know what to measure. A "good" RAG system balances three core components, often called the RAG Triad:

Context Relevance (Retrieval Quality): Did our retriever find the right information? Did it find all the info needed (Context Recall) and not include irrelevant junk (Context Precision)?
Faithfulness (Generation Quality): Did the LLM's answer stick to the facts from the retrieved context? Or did it "hallucinate" and make things up?
Answer Correctness (Overall Quality): Was the final answer actually correct and relevant to the user's question?

graph TD
    A[User Query] --> B(1. Retrieval)
    B -- "Did we find the right info?" --> C(Context Relevance)
    B --> D(2. Generation)
    D -- "Did the LLM stick to the context?" --> E(Faithfulness)
    D -- "Was the final answer correct?" --> F(Answer Correctness)
    
    style C fill:#f9f,stroke:#333,stroke-width:2px
    style E fill:#f9f,stroke:#333,stroke-width:2px
    style F fill:#f9f,stroke:#333,stroke-width:2px

We need to score our system on all of these metrics to get a complete picture.

How do you build a test pipeline and test questions?

To evaluate a system, we first need a system to test. We'll build a simple RAG pipeline based on a tiny knowledge base.

# filename: example.py
# description: Code example from the post.
# 1. Our "knowledge base" (just a few facts)
documents = [
    "The first wizarding war ended when Lord Voldemort's Killing Curse rebounded...",
    "Harry Potter was left with a lightning-bolt scar...",
    "The three Unforgivable Curses are the Imperius Curse, the Cruciatus Curse, and the Killing Curse."
]

# (Code to add these docs to a ChromaDB collection)

# 2. Our simple RAG pipeline function
def simple_rag_pipeline(question):
    # Retrieve
    retrieved_docs = collection.query(query_texts=[question], n_results=2)['documents'][0]
    context = "\n".join(retrieved_docs)
    
    # Generate
    prompt = f"Context:\n{context}\n\nQuestion: {question}"
    answer = call_llm(prompt) # call_llm is our function to talk to OpenAI
    
    return {"answer": answer, "contexts": retrieved_docs}

Next, and most importantly, we need an evaluation dataset. This includes questions and, for the best results, the "perfect" answers (called ground_truth).

Crucially, we must include questions that cannot be answered by our documents. This tests if the system knows what it doesn't know.

# Our "test sheet"
eval_questions = [
    "What caused the first wizarding war to end?",
    "What are the Unforgivable Curses?",
    "Who was the Minister for Magic during the first war?" # Not in our documents!
]

ground_truths = [
    "The war ended when Voldemort's Killing Curse rebounded on him.",
    "The Imperius, Cruciatus, and Killing Curses.",
    "The provided context does not mention the Minister for Magic."
]

How do you run the evaluation with Ragas?

We could grade these answers by hand, but it's slow and subjective. Instead, we'll use a framework called RAGAs.

First, we run our pipeline to get the outputs we want to score:

# 1. Run our pipeline on all questions
generated_data = []
for q in eval_questions:
    result = simple_rag_pipeline(q)
    generated_data.append(result)

# 2. Format the data for RAGAs
from datasets import Dataset

ragas_dataset_dict = {
    'question': eval_questions,
    'answer': [d['answer'] for d in generated_data],
    'contexts': [d['contexts'] for d in generated_data],
    'ground_truth': ground_truths
}

ragas_dataset = Dataset.from_dict(ragas_dataset_dict)

Now, we just hand this dataset to RAGAs and ask it to evaluate.

from ragas import evaluate
from ragas.metrics import faithfulness, answer_correctness, context_recall, context_precision

# 3. Run the evaluation!
result = evaluate(
    dataset=ragas_dataset,
    metrics=[
        context_precision,  # Did we retrieve relevant junk?
        context_recall,     # Did we retrieve all the right info?
        faithfulness,       # Did the answer stick to the context?
        answer_correctness  # Was the answer correct?
    ]
)

# 4. Display the results
print(result.to_pandas())

How do you prove an optimization works with a/b testing?

This is where evaluation becomes a superpower. Let's make a change and prove if it's better.

Hypothesis: Retrieving more documents (n_results=3 instead of 2) will improve our answers.

We create an advanced_rag_pipeline that's identical, but retrieves 3 documents. We run it, evaluate it, and compare the results.

Question	RAG (k=2) `answer_correctness`	RAG (k=3) `answer_correctness`
"What caused the war to end?"	0.95	0.96
"What are the Curses?"	0.51	0.92
"Who was the Minister?"	0.88	0.88

We have just quantitatively proven that our change (k=3) is a significant improvement.

If you can't measure it, you can't improve it: Evaluation is the core discipline of building production AI. Stop "feeling" and start measuring.
The RAG triad is your guide: Focus on your Retrieval Quality (Context Precision/Recall) and your Generation Quality (Faithfulness/Correctness) to get a complete picture.
Frameworks automate judging: Tools like RAGAs automate the complex, expensive task of LLM-based evaluation, letting you test and iterate rapidly.
Evaluation is comparative: The true power of evaluation is in A/B testing, proving that a change to your system leads to a measurable improvement.

For more on building production AI systems, check out our AI Bootcamp for Software Engineers.

Take the next step

RAG Fundamentals Workshop, Build and evaluate a production RAG pipeline hands-on

RAG evaluation: proving your system actually works

Share this post

The RAG triad: what do we measure?

How do you build a test pipeline and test questions?

How do you run the evaluation with Ragas?

How do you prove an optimization works with a/b testing?

Frequently asked questions

Why can't I just manually test my RAG system

What's the difference between context relevance and answer correctness

How do I prove my RAG optimization actually worked

Key takeaways

Take the next step

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

Choosing the LLM judge for evaluation pipelines

Ground truth vs relevancy in RAG evaluation

Ready to go deeper?

RAG evaluation: proving your system actually works

Share this post

The RAG triad: what do we measure?

How do you build a test pipeline and test questions?

How do you run the evaluation with Ragas?

How do you prove an optimization works with a/b testing?

Frequently asked questions

Why can't I just manually test my RAG system

What's the difference between context relevance and answer correctness

How do I prove my RAG optimization actually worked

Key takeaways

Take the next step

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

Choosing the LLM judge for evaluation pipelines

Ground truth vs relevancy in RAG evaluation

Ready to go deeper?

RAG evaluation: proving your system actually works

Share this post

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

Choosing the LLM judge for evaluation pipelines

Ground truth vs relevancy in RAG evaluation

Weekly Bytes of AI

Ready to go deeper?

RAG evaluation: proving your system actually works

Share this post

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

Choosing the LLM judge for evaluation pipelines

Ground truth vs relevancy in RAG evaluation

Weekly Bytes of AI

Ready to go deeper?