RAGAS evaluation for RAG pipelines: a practical guide
You changed the retriever and you have no idea if it got better
You swapped your embedding model, tuned your chunk size, added a reranker, and shipped. A week later, user feedback says the answers got worse. You open the eval notebook you wrote 3 months ago and realize it has 10 questions, all handpicked, and no ground truth. You cannot prove the new pipeline is better or worse. You cannot prove the old one was good either. You are flying blind.
This is the state of most RAG eval in production. Teams ship changes without numbers and hope for the best. RAGAS is the cheapest way to stop doing that. It gives you 4 metrics that score a pipeline automatically, using an LLM as judge, without needing a hand-graded ground truth for every question.
This post is the pattern, the metrics, the trap, and the 40 lines of code that turn a week of retriever changes into a number you can defend.
Why do most RAG evals lie to you?
Because they measure the wrong thing. A typical early-stage RAG eval looks like "did the final answer mention the right word." That is an exact-match test. It fails in 2 directions at once: it penalizes a correct answer that used different wording, and it rewards a wrong answer that happened to contain the keyword.
The deeper problem is that a RAG answer has 4 places where quality can break, not one:
- The retriever fetched the wrong chunks.
- The retriever fetched the right chunks but missed a relevant one.
- The LLM ignored the retrieved chunks and hallucinated.
- The LLM used the chunks but produced an answer that did not address the question.
A single accuracy number cannot tell you which of those 4 broke. You need a metric per failure mode, and RAGAS gives you exactly that.
graph TD
Q[Question + golden answer] --> R[Retriever]
R -->|chunks| G[Generator LLM]
G --> A[Predicted answer]
A --> M1[faithfulness]
R --> M1
A --> M2[answer relevancy]
Q --> M2
R --> M3[context precision]
Q --> M3
R --> M4[context recall]
Q --> M4
style M1 fill:#dbeafe,stroke:#1e40af
style M2 fill:#dcfce7,stroke:#15803d
style M3 fill:#fef3c7,stroke:#b45309
style M4 fill:#fce7f3,stroke:#9f1239
4 metrics, 4 failure modes. You can debug which layer of the pipeline regressed instead of guessing.
What are the 4 RAGAS metrics that actually matter?
Faithfulness measures whether the generated answer is supported by the retrieved context. An LLM judge splits the answer into claims and checks each one against the context. A claim not supported by any retrieved chunk drags the score down. This is your hallucination detector.
Answer relevancy measures whether the answer actually addresses the question. Again an LLM judge, but this time it rewrites the answer into questions it could plausibly be answering and measures how close those are to the original. A tangential answer scores low even if it is factually correct.
Context precision measures whether the top-ranked chunks are relevant. For each chunk, an LLM judge decides if it contains information that helps answer the question. A retriever that pulls a lot of noise in the top 5 has low precision. This is your reranker metric.
Context recall measures whether all the information needed for the golden answer is present somewhere in the retrieved chunks. This one needs a ground-truth answer to compare against. A retriever that misses a necessary chunk has low recall. This is your "did we retrieve enough" metric.
Together these 4 isolate where your pipeline broke. Faithfulness dropped? The generator is hallucinating. Context recall dropped? The retriever is missing chunks. Context precision dropped? The retriever is pulling noise. Answer relevancy dropped? The prompt or the model is drifting off-topic.
How do you run RAGAS on your pipeline in 40 lines?
Build a small eval dataset, run your pipeline on each question to collect the predicted answer and the retrieved contexts, then hand the whole thing to RAGAS.
# filename: eval.py
# description: Run your RAG pipeline against an eval set and score
# each output with the 4 RAGAS metrics.
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
from app.rag import answer # your pipeline: (question) -> (answer, contexts)
EVAL_SET = [
{
'question': 'How does the auth middleware validate a session token?',
'ground_truth': 'It looks up the token in Redis and checks the TTL.',
},
{
'question': 'What is the default rate limit per user?',
'ground_truth': '100 requests per minute.',
},
# ... 50 more
]
def build_dataset() -> Dataset:
rows = []
for item in EVAL_SET:
pred, contexts = answer(item['question'])
rows.append({
'question': item['question'],
'answer': pred,
'contexts': contexts,
'ground_truth': item['ground_truth'],
})
return Dataset.from_list(rows)
def run():
ds = build_dataset()
result = evaluate(ds, metrics=[
faithfulness,
answer_relevancy,
context_precision,
context_recall,
])
print(result)
The output is a table with one row per metric and an average score between 0 and 1. Run it before a change, run it after, compare. That comparison is the number you were missing.
Notice that answer returns both the answer string and the retrieved contexts. RAGAS needs the contexts because 3 of the 4 metrics judge the retriever, not the generator. If your pipeline does not expose contexts, this is the refactor that unlocks real eval.
This eval pattern is the backbone of the Agentic RAG Masterclass, which walks through building an eval set, running RAGAS in CI, and catching regressions before they ship. If you are still building your first retrieval pipeline, the free RAG Fundamentals primer is the right starting point.
How do you build an eval set that is not a lie?
An eval set is a lie when it was written by the person who built the pipeline. You unconsciously write questions your pipeline happens to answer well and skip the ones you know it fails on. Fix this with 3 rules.
First, pull questions from real user logs, not your imagination. Anonymize them if you have to. Real questions are messy, vague, and underspecified. That is exactly what you need to evaluate.
Second, write ground-truth answers by hand from the source documents, not from what your current pipeline generates. If you copy your pipeline's output into the ground truth, you are measuring how well the pipeline matches itself. You need an independent reference.
Third, cover all 4 failure modes. Include easy questions that test faithfulness, vague questions that test rewriting, multi-hop questions that test recall, and adversarial questions that test precision. A flat distribution of easy lookups will give you a flat, useless eval.
A good starting size is 50 questions. Below that, variance swamps signal. Above 200, the LLM-judge cost gets noticeable. Start at 50, add 10 more whenever you find a new failure mode in production, and trim outdated questions every quarter.
How does RAGAS handle ground truth and reference-free metrics?
This is the subtle part. 2 of the 4 metrics need ground truth (context recall, and optionally answer correctness). 2 do not (faithfulness, answer relevancy, context precision). That split matters because ground-truth writing is the expensive part of eval, so you want to use reference-free metrics wherever you can.
Reference-free means the LLM judge only sees the question, the retrieved context, and the predicted answer. No golden reference. It still produces a score by reasoning about whether the answer is grounded and relevant.
Use reference-free metrics on the large half of your eval set (say, 100 to 500 questions pulled from logs) and reference-based metrics on a smaller curated set where you have taken the time to write golden answers (say, 50 questions). This hybrid gets you broad coverage on cheap metrics and deep evaluation on the curated core.
For the broader grounding pattern that plays well with RAGAS eval, see the JSON Output Parsing for RAG: Grounding with Pydantic post. A grounded-output pipeline is easier to evaluate because faithfulness becomes a substring check.
What is the trap that makes RAGAS scores misleading?
Using the same LLM to generate answers and to judge them. If your pipeline uses Claude Sonnet and you run RAGAS with Claude Sonnet as the judge, the judge shares the same biases, training data, and failure modes as the generator. It will mark its own mistakes as correct more often than a neutral judge would.
Fix this by using a different model family for judging. If the pipeline runs on Claude, judge with GPT-4. If the pipeline runs on GPT, judge with Claude. For cheaper eval, use a mid-tier model from the other family (GPT-4o-mini, Claude Haiku) instead of the flagship.
The second trap is running RAGAS on too few questions. With 10 questions, a single flaky LLM judgment moves the score by 10 percentage points. You cannot compare runs at that noise level. 50 is the minimum for stable numbers, 100 is comfortable, 200 is diminishing returns.
What to do Monday morning
- Pull 50 anonymized questions from your production logs. Write ground truth for them by hand from the source documents. This is the painful step. Do it once.
- Refactor your pipeline so
answer(question)returns both the predicted string and the list of retrieved context chunks. Without that refactor, RAGAS cannot score retrieval. - Add the
eval.pyscript from this post. Run it against your current pipeline to establish a baseline. - Change one thing (new embedding model, new chunk size, new reranker). Run eval again. Compare the 4 metrics. Ship the change only if no metric regressed by more than 2 points.
- Wire the eval script into CI. Every pull request that touches the RAG pipeline should print the 4 metrics in the PR checks. This is what turns "we think it's better" into "we can prove it."
The headline: RAGAS gives you 4 numbers that isolate which layer of your RAG pipeline broke. Without those numbers, every change is a guess. With them, every change is a measurement.
Frequently asked questions
What is RAGAS?
RAGAS is an open-source evaluation framework for RAG pipelines. It uses LLMs as judges to score predicted answers and retrieved contexts against ground truth and against the question itself. The 4 core metrics are faithfulness, answer relevancy, context precision, and context recall. Together they isolate which layer of a RAG pipeline regressed when a change ships.
What are the most important RAGAS metrics?
Faithfulness (does the answer hallucinate), answer relevancy (does the answer address the question), context precision (are the top chunks relevant), and context recall (did the retriever find all needed chunks). Faithfulness and relevancy measure the generator; precision and recall measure the retriever. Together they cover all 4 places where a RAG pipeline can break, which is why you need all 4, not just one.
Do I need ground truth for RAGAS?
Only for context recall and answer correctness. Faithfulness, answer relevancy, and context precision are reference-free and need only the question, retrieved context, and predicted answer. Use reference-free metrics on a large pool of anonymized production questions, and reserve ground-truth metrics for a smaller curated set of 50 or so questions you have hand-graded.
How large should a RAGAS eval set be?
50 questions is the minimum for stable numbers. Below that, noise from individual LLM judgments swamps the signal and run-to-run variance is larger than the effect you are trying to measure. 100 is comfortable, 200 is diminishing returns. Grow the set whenever you find a new production failure mode that is not represented yet.
Can I use the same LLM for my pipeline and for RAGAS judging?
No. A judge from the same model family shares biases with the generator and will mark its own mistakes as correct more often than a neutral judge. Use a different model family for judging. If your pipeline runs on Claude, judge with GPT-4 or a mid-tier OpenAI model. If it runs on GPT, judge with Claude. Cross-family judging gives you honest scores.
Key takeaways
- A single accuracy number cannot tell you which layer of a RAG pipeline regressed. You need a metric per failure mode.
- RAGAS gives you 4 metrics that isolate retrieval precision, retrieval recall, hallucination, and answer relevancy. Together they cover the whole pipeline.
- Build an eval set from real user logs with hand-written ground truth. 50 questions minimum, pulled from production not imagination.
- Use reference-free metrics on a large log-derived set and reference-based metrics on a smaller curated core. Hybrid eval keeps the cost low.
- Never judge with the same model family as the generator. Cross-family judging is the difference between honest and self-congratulatory scores.
- To see this eval loop wired into a full production agentic RAG stack alongside reranking and self-correction, walk through the Agentic RAG Masterclass, or start with the RAG Fundamentals primer.
For the full RAGAS documentation, API reference, and metric definitions, see the official RAGAS docs. The repo includes reference implementations of every metric in this post plus several advanced ones.
Continue Reading
Ready to go deeper?
Go beyond articles. Build production AI systems with hands-on workshops and our intensive AI Bootcamp.