Your RAG pipeline never says "I don't know" and you wonder if it is hallucinating

Your pipeline handles every query confidently. Users ask in-scope questions and get good answers. Users ask out-of-scope questions and also get confident-sounding answers that are entirely made up. You cannot tell the difference from the outside because the model's tone is identical either way.

The fix is systematic hallucination testing. Build 3 test sets: in-scope questions with known answers, out-of-scope questions that have no answer in your corpus, and adversarial questions designed to trip grounding. Measure how often the pipeline hallucinates on each set. The metric that matters is "unsupported claim rate," and it should be near zero on out-of-scope.

This post is the hallucination testing pattern for RAG: how to build each test set, the metric, the common hallucination modes, and the 5-minute CI integration that catches regressions before they ship.

Why is hallucination so hard to detect without testing?

Because hallucinations are semantic, not syntactic. The output looks correct; only a human or a second LLM can tell it is made up. 3 specific failure modes of ad-hoc testing:

  1. Confirmation bias. You test with queries you think have good answers. The model sounds confident. You miss the hallucinations on queries outside your test set.

  2. No baseline for "I don't know." If you never test out-of-scope queries, you never see whether the model appropriately refuses.

  3. Silent regression. A prompt change or model swap can increase hallucination rate without any visible failure. Your eval set does not catch it because it focused on recall and relevance.

Systematic hallucination testing fixes all 3 by explicitly including queries that should NOT have answers and measuring how the pipeline handles them.

graph TD
    Test[Hallucination test set] --> InScope[In-scope: 50 queries with known answers]
    Test --> OutScope[Out-of-scope: 50 queries with no answer in corpus]
    Test --> Adversarial[Adversarial: 20 queries designed to trip grounding]

    InScope --> Pipeline[RAG pipeline]
    OutScope --> Pipeline
    Adversarial --> Pipeline

    Pipeline --> Check[Fact-check each answer]
    Check --> Metric[Unsupported claim rate per set]

    style OutScope fill:#fef3c7,stroke:#b45309
    style Metric fill:#dcfce7,stroke:#15803d

What are the 3 test sets?

In-scope set

50 questions whose answers are in the retrieved corpus. For each, record the expected answer. Metric: correctness and unsupported claim rate. Target: correctness above 0.85, unsupported claim rate under 5 percent.

Out-of-scope set

50 questions that look plausible but have NO answer in your corpus. Example: if your corpus is internal engineering docs, out-of-scope queries are about sports, recent news, user's personal life. The pipeline should refuse or say "I don't know." Metric: refusal rate. Target: refusal rate above 90 percent.

Adversarial set

20 questions designed to trip grounding. Examples:

  • "According to the docs, the rate limit is 10000 rps, right?" (false premise embedded in query)
  • "Explain the X feature mentioned on page 47 of the API guide" (specific but fabricated reference)
  • "What did the CEO say about the layoffs last week?" (leading + out of scope)

Metric: hallucination rate on false premises. Target: under 20 percent (adversarial is harder than simple out-of-scope).

How do you measure unsupported claim rate?

Run the fact-checker from the Fact-checking RAG answers post on every test answer. For each answer, extract claims and verify each against the retrieved context. Unsupported claim rate is the fraction of answers with at least one unsupported claim.

# filename: tests/hallucination/test_rag.py
# description: Hallucination test suite for a RAG pipeline.
import json
import pytest
from app.rag.pipeline import rag_answer
from app.rag.fact_check import fact_check


def load_set(name: str) -> list[dict]:
    return json.loads(open(f'tests/hallucination/fixtures/{name}.json').read())


@pytest.mark.nightly
@pytest.mark.asyncio
@pytest.mark.parametrize('set_name,max_unsupported', [
    ('in_scope', 0.05),       # under 5 percent
    ('out_of_scope', 0.10),   # under 10 percent (most should refuse)
    ('adversarial', 0.20),    # under 20 percent
])
async def test_unsupported_claim_rate(set_name, max_unsupported):
    test_set = load_set(set_name)
    unsupported_count = 0
    for item in test_set:
        result = await rag_answer(item['question'])
        verdicts = fact_check(result['answer'], result['context'])
        if any(not v['supported'] for v in verdicts):
            unsupported_count += 1
    rate = unsupported_count / len(test_set)
    assert rate <= max_unsupported, f'{set_name} unsupported rate: {rate:.2%}'

Run this nightly. A regression on any set fails the test and triggers an alert.

What are the 3 common hallucination modes?

  1. Filling confidently on out-of-scope. The model does not recognize that the context does not contain the answer and fabricates something plausible. Fix: stricter grounding prompt + fact-check post-pass.

  2. Accepting false premises. The user says "According to the docs, X is 10000 rps" when the docs say no such thing. The model parrots back the false claim as if confirmed. Fix: instruct the model to verify claims in the query against the context before answering.

  3. Over-interpretation. The context says "the API supports rate limiting." The model answers "the default rate limit is 100 requests per minute" because it made up the specific number. Fix: fact-check with strict "directly stated" criterion, reject inference beyond what is written.

For the prompt engineering that reduces hallucination, see the JSON output parsing for RAG: grounding with Pydantic post.

How do you build the adversarial set?

Write the questions by hand. Each one should have a specific failure mode in mind. 4 templates to start:

  1. False premise: "According to the docs, [made-up fact]. Can you explain more?"
  2. Fabricated reference: "On page X, it mentions [fabricated specific detail]. What does this mean?"
  3. Leading question: "Why is the rate limit so low? Don't the docs say [made-up number]?"
  4. Out-of-scope but plausible: A question about a feature that sounds like your product but is not.

5 questions per template, 20 total. Review the answers manually for the first few runs to calibrate what counts as hallucination vs reasonable interpretation.

What to do Monday morning

  1. Build the 3 test sets: in-scope (50), out-of-scope (50), adversarial (20). The in-scope and out-of-scope sets can come from real logs; the adversarial set is hand-written.
  2. Add the fact-check post-pass to your RAG pipeline if you do not have one already.
  3. Write the test_unsupported_claim_rate test with the 3 parametrized sets. Mark nightly.
  4. Run it. Record baseline rates. If in-scope is over 5 percent or out-of-scope is over 10 percent, the fact-checker is too loose or the pipeline is hallucinating heavily.
  5. Wire the test into a nightly CI job. Alert on any set's rate rising above its threshold.
  6. Add new adversarial questions every time you discover a new hallucination in production. The set grows over time and becomes the regression defense.

The headline: hallucination testing is 3 test sets and one metric. In-scope, out-of-scope, adversarial. Run nightly. Alert on drift. Catches the regressions that correctness-only eval misses entirely.

Frequently asked questions

Why does my RAG pipeline hallucinate on out-of-scope queries?

Because LLMs are trained to be helpful and confident, and a strict "answer only from context" instruction competes with that training. When the retrieved context does not contain the answer, the model often fills the gap with a plausible-sounding response instead of refusing. Systematic out-of-scope testing surfaces this and makes it measurable.

What is an adversarial hallucination test set?

A set of hand-written questions designed to trip grounding. Examples include false premises embedded in the query, fabricated specific references, leading questions, and plausible-sounding out-of-scope topics. The set is smaller than the in-scope set (15-25 questions) but each one targets a specific failure mode.

What metric should I track for hallucination?

Unsupported claim rate per test set. For each answer, extract claims with an LLM and verify each against the retrieved context. The fraction of answers with at least one unsupported claim is the unsupported claim rate. Target: under 5 percent on in-scope, under 10 percent on out-of-scope, under 20 percent on adversarial.

How often should hallucination tests run?

Nightly, alongside your other eval tests. Each run costs 100+ fact-check LLM calls, which adds up but catches regressions within 24 hours of a bad merge. If your budget is tight, run the in-scope set on every PR and the full 3-set suite nightly.

How do I build the out-of-scope set?

Sample real production queries that your pipeline currently refuses or fails on. Also add synthetic questions about topics clearly outside your corpus (sports, weather, user personal life if your corpus is engineering docs). 50 queries total, half from real logs and half synthetic.

Key takeaways

  1. Hallucinations are semantic, not syntactic. Ad-hoc testing misses them because the output looks correct. Systematic testing surfaces them.
  2. Build 3 test sets: in-scope (50 with known answers), out-of-scope (50 with no answer), adversarial (20 designed to trip grounding).
  3. Metric: unsupported claim rate per set. Use a fact-checker to extract claims and verify each against the retrieved context.
  4. Target rates: under 5 percent in-scope, under 10 percent out-of-scope, under 20 percent adversarial. Tighter is better.
  5. Run nightly in CI. Alert on any rate rising above its threshold. Add new adversarial questions after every production hallucination incident.
  6. To see hallucination testing wired into a full production RAG pipeline with evaluation and observability, walk through the Agentic RAG Masterclass, or start with the RAG Fundamentals primer.

For the Ragas framework that operationalizes hallucination and faithfulness metrics in production, see the Ragas documentation on faithfulness.

Share this post

Continue Reading

Weekly Bytes of AI

Technical deep-dives for engineers building production AI systems.

Architecture patterns, system design, cost optimization, and real-world case studies. No fluff, just engineering insights.

Unsubscribe anytime. We respect your inbox.

Ready to go deeper?

Go beyond articles. Build production AI systems with hands-on workshops and our intensive AI Bootcamp.