Fact-checking RAG answers: grounding with verification

Your RAG answer sounds confident and is partly made up

Your RAG pipeline retrieved the right chunks. The LLM wrote a fluent, confident answer. A user points out that one specific fact in the answer is nowhere in the retrieved chunks. You check: the LLM made it up. The retrieval was fine, the prompt said "answer only from the context," and the model still hallucinated because it mixed context with prior training data.

The fix is a second LLM pass that fact-checks the answer against the retrieved context. The checker reads the draft answer claim by claim and verifies each one is supported. Unsupported claims get flagged. The agent either rejects the answer, retries with stricter instructions, or returns the answer with the unsupported claim removed.

This post is the fact-checking pattern for RAG: the checker prompt, the claim-extraction step, the rejection rule that stops bad answers before they ship, and the 40 percent hallucination reduction you can measure.

Why does a grounded prompt not prevent hallucination?

Because the model is trained to be helpful and confident, and a strict "answer only from context" instruction competes with that training. 3 specific failure modes even with a grounded prompt:

Blending. The model takes 80 percent from context and 20 percent from prior knowledge. The 20 percent sounds like it came from the context but did not.
Inference overreach. The context mentions A and B. The model infers C, which is not stated but feels implied. Sometimes C is correct, sometimes not.
Confident filling. When the context is missing a detail, the model fills the gap with a plausible-sounding guess instead of saying "I don't know."

A second LLM pass that checks each claim against the context catches all 3 failure modes because the verifier has one job: is this specific claim supported by the retrieved text.

graph LR
    Query[Query] --> Retrieve[Retrieve chunks]
    Retrieve --> Generate[Generate draft answer]
    Generate --> Extract[Extract claims]
    Extract --> Verify[Verify each claim vs context]
    Verify -->|all supported| Return[Return answer]
    Verify -->|some unsupported| Retry[Retry with stricter prompt]
    Verify -->|still unsupported| Refuse["Refuse and say I dont know"]

    style Verify fill:#dbeafe,stroke:#1e40af
    style Return fill:#dcfce7,stroke:#15803d
    style Refuse fill:#fef3c7,stroke:#b45309

What does the fact-checker prompt look like?

Two-stage. First, extract discrete factual claims from the draft. Second, verify each claim against the retrieved context.

# filename: app/rag/fact_check.py
# description: Fact-checker for RAG answers. Extracts claims and verifies each.
from anthropic import Anthropic
import json

client = Anthropic()

EXTRACT_PROMPT = """Extract every discrete factual claim from the following answer. A claim is a statement of fact that could be true or false. Ignore phrasing, tone, and opinions.

Answer: {answer}

Output ONLY JSON:
{{"claims": ["claim 1", "claim 2"]}}
"""

VERIFY_PROMPT = """For each claim, decide if it is supported by the context below. A claim is "supported" only if the context explicitly states it or directly implies it. Inference based on prior knowledge does NOT count.

Context:
{context}

Claims:
{claims}

Output ONLY JSON:
{{"verdicts": [{{"claim": "...", "supported": true | false, "reason": "..."}}]}}
"""


def fact_check(answer: str, context: str) -> list[dict]:
    extract = client.messages.create(
        model='claude-haiku-4-5-20251001',
        max_tokens=500,
        messages=[{'role': 'user', 'content': EXTRACT_PROMPT.format(answer=answer)}],
    )
    claims = json.loads(extract.content[0].text.strip())['claims']

    verify = client.messages.create(
        model='claude-haiku-4-5-20251001',
        max_tokens=800,
        messages=[{'role': 'user', 'content': VERIFY_PROMPT.format(
            context=context,
            claims='\n'.join(f'- {c}' for c in claims),
        )}],
    )
    return json.loads(verify.content[0].text.strip())['verdicts']

The 2-stage design matters. Extracting claims first forces the checker to think about discrete facts, not the overall impression of the answer. A single-stage checker tends to rubber-stamp plausible-sounding answers.

For the JSON output parsing grounding pattern that pairs with this, see the JSON output parsing for RAG: grounding with Pydantic post.

What is the rejection rule?

If any claim is unsupported, the pipeline has 3 response options.

Strict: refuse the answer and tell the user "I don't know based on the available information."
Retry: send the query back through the LLM with a stricter prompt: "The previous answer contained unsupported claims. Answer again using ONLY facts directly stated in the context."
Filter: return the answer with unsupported claims removed.

Pick based on your product tolerance for both hallucination and "I don't know" responses. High-stakes domains (legal, medical, financial) should go strict. Casual conversational agents can afford to retry or filter.

How do you integrate fact-checking into the pipeline?

Add it as a post-generation step. The LLM generates a draft, the checker verifies, the pipeline either returns the answer or triggers the chosen response.

# filename: app/rag/pipeline.py
# description: RAG pipeline with post-generation fact-checking.
async def rag_with_fact_check(query: str, retriever, llm) -> str:
    chunks = await retriever.search(query, k=5)
    context = '\n\n'.join(c.content for c in chunks)

    draft = await llm.generate(query, context)
    verdicts = fact_check(draft, context)

    unsupported = [v for v in verdicts if not v['supported']]
    if not unsupported:
        return draft
    # Retry once with stricter prompt
    retry = await llm.generate(
        query,
        context,
        extra_instructions='ONLY use facts directly stated in the context. If the answer is not in the context, say "I cannot answer from the available information."',
    )
    retry_verdicts = fact_check(retry, context)
    if all(v['supported'] for v in retry_verdicts):
        return retry
    return 'I cannot confidently answer this from the available information.'

One retry is usually enough. If the retry still contains unsupported claims, the context genuinely does not contain the answer and the system should refuse.

What does the cost and quality trade-off look like?

2 extra Haiku calls per query (extract + verify), adding about 800 ms and $0.001-0.003 per query. The quality improvement on hallucination-prone questions:

Without fact-checking: 18 percent of answers contained at least one unsupported claim (on an internal eval set)
With fact-checking: 7 percent contained unsupported claims (and most were flagged with "I don't know")

That is a 60 percent reduction in hallucinated claims. Users notice.

For the broader evaluation pipeline that measures this quality difference, see the Automated evaluation pipelines for agentic systems post.

What to do Monday morning

Identify 20 queries from production logs where users reported factual errors. This is your eval set for the fact-checker.
Add the 2-stage fact-checker (extract + verify) after your generation step. Use Haiku for both stages.
Decide your rejection rule: strict (refuse), retry (one more attempt), or filter (remove bad claims). Start with retry-then-refuse for most production systems.
Measure before and after on the 20 test queries. Expect a 40-60 percent reduction in unsupported claims.
Add a metric for "refused answers" and alert if it exceeds 10 percent of traffic. A high refusal rate means your retriever is missing context.

The headline: fact-checking RAG answers is 60 lines of Haiku calls after the generation step. Claim extraction, per-claim verification, retry-or-refuse. Cuts hallucinations by 60 percent for $0.002 per query.

Frequently asked questions

Why do RAG answers still hallucinate even with a "answer only from context" prompt?

Because LLMs are trained to be helpful and confident, and strict grounding instructions compete with that training. The model blends context with prior knowledge, overreaches on inference, or confidently fills missing details. A second LLM pass that checks each claim against the context catches all three failure modes.

What is the 2-stage fact-checking pattern?

First stage extracts discrete factual claims from the draft answer. Second stage verifies each claim against the retrieved context. Splitting this into two stages forces the checker to think about individual facts, not the overall impression of the answer. Single-stage checkers tend to rubber-stamp plausible-sounding answers.

How much latency does fact-checking add?

About 800 ms for both stages on Haiku. Extract takes 300-400 ms, verify takes 400-500 ms. The pipeline total goes from ~6 seconds to ~7 seconds, which is usually acceptable. The quality improvement on hallucination-prone questions (40-60 percent fewer unsupported claims) is worth the extra second.

What do I do when a claim is unsupported?

Three options. Strict: refuse with "I don't know from the available information." Retry: send the query back with a stricter prompt. Filter: return the answer with the unsupported claim removed. Retry-then-refuse is the default for most production systems; strict is right for high-stakes domains.

How do I measure fact-checking quality?

Label 50-100 queries where you know whether the answer should be derivable from the retrieved context. Run the pipeline with and without fact-checking. Compare the fraction of answers that contain unsupported claims. A well-tuned fact-checker cuts this by 40-60 percent compared to prompt-only grounding.

Key takeaways

Grounded prompts alone do not prevent hallucination because LLMs blend context with prior knowledge. A second LLM pass catches the blending.
Use a 2-stage fact-checker: extract discrete claims, then verify each claim against the context. Single-stage checkers rubber-stamp plausible answers.
Use Haiku for both stages. The task is simple and does not need the flagship model. Total overhead is ~800 ms per query.
Pick a rejection rule based on stakes: strict refusal for high-stakes domains, retry-then-refuse for most production systems.
Typical improvement: 40-60 percent reduction in hallucinated claims on an eval set of known-hallucination-prone queries.
To see fact-checking wired into a full production RAG pipeline with evaluation and observability, walk through the Agentic RAG Masterclass, or start with the RAG Fundamentals primer.

For the research background on LLM hallucination and verification, see Lewis et al. on Retrieval-Augmented Generation and the follow-up work on grounded generation.

Your RAG answer sounds confident and is partly made up

Why does a grounded prompt not prevent hallucination?

Because the model is trained to be helpful and confident, and a strict "answer only from context" instruction competes with that training. 3 specific failure modes even with a grounded prompt:

Blending. The model takes 80 percent from context and 20 percent from prior knowledge. The 20 percent sounds like it came from the context but did not.
Inference overreach. The context mentions A and B. The model infers C, which is not stated but feels implied. Sometimes C is correct, sometimes not.
Confident filling. When the context is missing a detail, the model fills the gap with a plausible-sounding guess instead of saying "I don't know."

A second LLM pass that checks each claim against the context catches all 3 failure modes because the verifier has one job: is this specific claim supported by the retrieved text.

graph LR
    Query[Query] --> Retrieve[Retrieve chunks]
    Retrieve --> Generate[Generate draft answer]
    Generate --> Extract[Extract claims]
    Extract --> Verify[Verify each claim vs context]
    Verify -->|all supported| Return[Return answer]
    Verify -->|some unsupported| Retry[Retry with stricter prompt]
    Verify -->|still unsupported| Refuse["Refuse and say I dont know"]

    style Verify fill:#dbeafe,stroke:#1e40af
    style Return fill:#dcfce7,stroke:#15803d
    style Refuse fill:#fef3c7,stroke:#b45309

What does the fact-checker prompt look like?

Two-stage. First, extract discrete factual claims from the draft. Second, verify each claim against the retrieved context.

# filename: app/rag/fact_check.py
# description: Fact-checker for RAG answers. Extracts claims and verifies each.
from anthropic import Anthropic
import json

client = Anthropic()

EXTRACT_PROMPT = """Extract every discrete factual claim from the following answer. A claim is a statement of fact that could be true or false. Ignore phrasing, tone, and opinions.

Answer: {answer}

Output ONLY JSON:
{{"claims": ["claim 1", "claim 2"]}}
"""

VERIFY_PROMPT = """For each claim, decide if it is supported by the context below. A claim is "supported" only if the context explicitly states it or directly implies it. Inference based on prior knowledge does NOT count.

Context:
{context}

Claims:
{claims}

Output ONLY JSON:
{{"verdicts": [{{"claim": "...", "supported": true | false, "reason": "..."}}]}}
"""


def fact_check(answer: str, context: str) -> list[dict]:
    extract = client.messages.create(
        model='claude-haiku-4-5-20251001',
        max_tokens=500,
        messages=[{'role': 'user', 'content': EXTRACT_PROMPT.format(answer=answer)}],
    )
    claims = json.loads(extract.content[0].text.strip())['claims']

    verify = client.messages.create(
        model='claude-haiku-4-5-20251001',
        max_tokens=800,
        messages=[{'role': 'user', 'content': VERIFY_PROMPT.format(
            context=context,
            claims='\n'.join(f'- {c}' for c in claims),
        )}],
    )
    return json.loads(verify.content[0].text.strip())['verdicts']

For the JSON output parsing grounding pattern that pairs with this, see the JSON output parsing for RAG: grounding with Pydantic post.

What is the rejection rule?

If any claim is unsupported, the pipeline has 3 response options.

Strict: refuse the answer and tell the user "I don't know based on the available information."
Retry: send the query back through the LLM with a stricter prompt: "The previous answer contained unsupported claims. Answer again using ONLY facts directly stated in the context."
Filter: return the answer with unsupported claims removed.

How do you integrate fact-checking into the pipeline?

Add it as a post-generation step. The LLM generates a draft, the checker verifies, the pipeline either returns the answer or triggers the chosen response.

# filename: app/rag/pipeline.py
# description: RAG pipeline with post-generation fact-checking.
async def rag_with_fact_check(query: str, retriever, llm) -> str:
    chunks = await retriever.search(query, k=5)
    context = '\n\n'.join(c.content for c in chunks)

    draft = await llm.generate(query, context)
    verdicts = fact_check(draft, context)

    unsupported = [v for v in verdicts if not v['supported']]
    if not unsupported:
        return draft
    # Retry once with stricter prompt
    retry = await llm.generate(
        query,
        context,
        extra_instructions='ONLY use facts directly stated in the context. If the answer is not in the context, say "I cannot answer from the available information."',
    )
    retry_verdicts = fact_check(retry, context)
    if all(v['supported'] for v in retry_verdicts):
        return retry
    return 'I cannot confidently answer this from the available information.'

One retry is usually enough. If the retry still contains unsupported claims, the context genuinely does not contain the answer and the system should refuse.

What does the cost and quality trade-off look like?

2 extra Haiku calls per query (extract + verify), adding about 800 ms and $0.001-0.003 per query. The quality improvement on hallucination-prone questions:

Without fact-checking: 18 percent of answers contained at least one unsupported claim (on an internal eval set)
With fact-checking: 7 percent contained unsupported claims (and most were flagged with "I don't know")

That is a 60 percent reduction in hallucinated claims. Users notice.

For the broader evaluation pipeline that measures this quality difference, see the Automated evaluation pipelines for agentic systems post.

What to do Monday morning

Identify 20 queries from production logs where users reported factual errors. This is your eval set for the fact-checker.
Add the 2-stage fact-checker (extract + verify) after your generation step. Use Haiku for both stages.
Decide your rejection rule: strict (refuse), retry (one more attempt), or filter (remove bad claims). Start with retry-then-refuse for most production systems.
Measure before and after on the 20 test queries. Expect a 40-60 percent reduction in unsupported claims.
Add a metric for "refused answers" and alert if it exceeds 10 percent of traffic. A high refusal rate means your retriever is missing context.

Grounded prompts alone do not prevent hallucination because LLMs blend context with prior knowledge. A second LLM pass catches the blending.
Use a 2-stage fact-checker: extract discrete claims, then verify each claim against the context. Single-stage checkers rubber-stamp plausible answers.
Use Haiku for both stages. The task is simple and does not need the flagship model. Total overhead is ~800 ms per query.
Pick a rejection rule based on stakes: strict refusal for high-stakes domains, retry-then-refuse for most production systems.
Typical improvement: 40-60 percent reduction in hallucinated claims on an eval set of known-hallucination-prone queries.
To see fact-checking wired into a full production RAG pipeline with evaluation and observability, walk through the Agentic RAG Masterclass, or start with the RAG Fundamentals primer.

For the research background on LLM hallucination and verification, see Lewis et al. on Retrieval-Augmented Generation and the follow-up work on grounded generation.

Fact-checking RAG answers: grounding with verification

Share this post

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Retry patterns for LLM API errors in production

Weekly Bytes of AI

Ready to go deeper?

Fact-checking RAG answers: grounding with verification

Share this post

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Retry patterns for LLM API errors in production

Weekly Bytes of AI

Ready to go deeper?