Your RAG answers look confident and are half wrong

You ask your RAG system "which 2 services were affected by the outage and why?" It returns a confident paragraph naming 2 services and a plausible cause. Later you find out one of the services was not affected and the cause was something completely different. The answer looked right, the retrieval looked right, and the model synthesized them into a confident lie.

This is what happens when you ask an LLM to answer a multi-step question in one shot. It cannot show its work, so it cannot check its own reasoning, so it confidently combines partial evidence into a wrong answer. Chain-of-thought (CoT) reasoning fixes this by forcing the model to write out its intermediate steps before committing to an answer. The steps are the checkable audit trail that catches the combinatorial errors.

This post is the CoT prompt pattern for RAG, the parser that extracts the reasoning from the final answer, the cases where CoT is worth the extra tokens, and the guardrails that prevent the reasoning from becoming its own source of hallucination.

Why does single-shot answering fail on multi-step questions?

Because the model has to do too many things at once. For a question like "which 2 services were affected and why," a single-shot prompt asks the model to simultaneously:

  1. Identify which services are mentioned in the retrieved context.
  2. Decide which of those were affected.
  3. Find the cause for each.
  4. Format the final 2-part answer.

Every one of those steps is a chance to be wrong. A model that writes the answer in one pass cannot correct itself once it starts, because the output is linear and commits to a framing before the evidence is fully weighed.

CoT inserts explicit intermediate steps. The model writes "let me think through this: the context mentions services A, B, C, and D. A and B both had spike events in the logs. C had a config change but no observable impact. D was not mentioned at all. So the affected services are A and B, and the cause is..." The intermediate reasoning is the correction point. The final answer is produced after the reasoning, not instead of it.

graph LR
    Q[Multi-step question] --> Single[Single-shot prompt]
    Single -->|confident paragraph| Wrong[Often wrong, looks right]

    Q --> CoT[Chain-of-thought prompt]
    CoT --> Think[Step-by-step reasoning]
    Think --> Final[Final answer]
    Final --> Right[More often correct]

    style Wrong fill:#fee2e2,stroke:#b91c1c
    style Right fill:#dcfce7,stroke:#15803d

The reasoning steps are expensive (more tokens) but they are cheap compared to the cost of a confidently wrong answer in a production system.

What does a CoT RAG prompt look like?

A prompt that explicitly asks for intermediate steps before the final answer, with a format that a parser can reliably extract. The key is making the reasoning and the answer distinguishable so a downstream process can use the answer without parsing the reasoning.

# filename: cot_rag.py
# description: A chain-of-thought RAG prompt that separates reasoning
# and final answer for reliable parsing.
import json
from anthropic import Anthropic

client = Anthropic()

COT_PROMPT = '''Answer the question using only the retrieved context.
Before writing the final answer, write out your reasoning step by step.
Output JSON only, with exactly these keys:
{{"reasoning": "...", "answer": "..."}}

The reasoning field should contain your intermediate thinking.
The answer field should contain only the final answer, with no reasoning.

Context:
{context}

Question: {question}'''


def answer_with_cot(question: str, context: str) -> dict:
    reply = client.messages.create(
        model='claude-sonnet-4-6',
        max_tokens=1024,
        messages=[{'role': 'user', 'content': COT_PROMPT.format(
            context=context, question=question,
        )}],
    )
    return json.loads(reply.content[0].text)

The JSON format is not cosmetic. It lets downstream processes use result['answer'] for the user-facing response and result['reasoning'] for audit logs, without fragile regex parsing. The model treats the 2 fields as separate outputs, which reduces the chance of bleed-through between reasoning and final answer.

How do you use the reasoning without showing it to the user?

Log it, do not display it. The reasoning is your audit trail, not your UI. Users generally do not want to read "let me think through this step by step" paragraphs; they want the answer. Save the reasoning for debugging, eval, and incident response.

The exception: when the user explicitly asks for explanations ("walk me through how you got this") or when the question is complex enough that citing the evidence is part of the value. In those cases, show the reasoning as a collapsible detail under the answer.

3 uses for logged reasoning beyond the immediate response:

  1. Debugging wrong answers. When a user flags an answer as incorrect, the reasoning tells you whether the model misread the context, missed a step, or had the right idea and formatted the final answer poorly. Different failure modes have different fixes.
  2. Retrieval evaluation. If the reasoning says "the context does not mention service C," you know the retrieval missed a relevant document even if the final answer happens to be right. This is signal the final answer alone cannot give you.
  3. Training data. Correct reasoning traces are high-quality examples for fine-tuning or prompt-improvement iteration.

When is chain-of-thought worth the extra tokens?

3 question types where CoT pays for itself, and 3 where it does not.

CoT wins on:

  1. Multi-step questions. "Which 2 services and why" requires separate identification and cause steps. CoT catches the cases where the model would otherwise conflate them.
  2. Comparison questions. "How does A differ from B" needs parallel analysis of 2 sources. CoT forces the model to look at both sides before synthesizing.
  3. Questions with distractors. When the retrieved context contains 3 relevant and 2 irrelevant chunks, CoT forces the model to explicitly exclude the distractors instead of averaging them into a muddled answer.

CoT does not help much on:

  1. Single-fact lookup. "What is the default rate limit" is a direct read. The model can pull the number straight from context. Extra reasoning just costs tokens.
  2. Yes/no questions with unambiguous evidence. "Is logging enabled in prod" either has a clear answer in the context or it does not. CoT adds no value.
  3. High-volume FAQ traffic. If you are answering millions of simple lookup questions per day, the cost of CoT on every call is real. Route complex questions through CoT and easy ones through a straight prompt.

A classifier up front picks which path each question takes. This is how most production RAG systems use CoT: selectively, not universally.

How does CoT interact with grounding and citation?

CoT without grounding is a risk. The reasoning step can introduce its own hallucinations if the model extrapolates beyond the retrieved context. The fix is to require that every claim in the reasoning be attributable to a specific chunk.

The stronger pattern pairs CoT with a grounded output schema: the model writes reasoning, then writes an answer, then cites the chunks each part of the reasoning relied on. The citations become a substring check against the retrieved context, which catches the cases where the reasoning drifted.

# filename: grounded_cot.py
# description: CoT with explicit citations that can be validated
# against the retrieved chunks after generation.
from pydantic import BaseModel


class GroundedCoTAnswer(BaseModel):
    reasoning: str
    answer: str
    citations: list[str]  # verbatim substrings from the context


def validate(result: GroundedCoTAnswer, context: str) -> bool:
    return all(c.strip() in context for c in result.citations)

If validate returns False, the model hallucinated a citation. Retry with a stronger prompt or fall back to an explicit "I cannot answer from the retrieved context" response. This validation is the single biggest safety lever when combining CoT with RAG.

For the full grounding pattern with Pydantic, see the JSON Output Parsing for RAG: Grounding with Pydantic post. For the overall production RAG picture, the Agentic RAG Masterclass covers CoT alongside reranking and self-correction.

How do you avoid runaway reasoning?

Cap the reasoning length. A model that is told to "think step by step" with no budget will sometimes write 500 tokens of reasoning for a 20-token question. Cap with explicit instructions in the prompt ("use at most 3 reasoning steps") or by truncating the model's output at a fixed token count before returning.

The rule: reasoning should be 1 to 3 times the length of the final answer, not 10 times. A 50-word answer with 500 words of reasoning is either a sign that the model is rambling or a sign that the question is genuinely complex. Either way, cap the budget and force the model to be terse.

What to do Monday morning

  1. Classify your RAG eval set by question complexity. Mark each as single-fact, comparison, multi-step, or distractor-heavy.
  2. Run the CoT prompt on the 3 complex categories. Compare accuracy to your current single-shot prompt. Expect a 10 to 20 point lift on those questions and roughly 0 on simple ones.
  3. Add a question classifier in front of your RAG pipeline to route complex questions to CoT and simple ones to a direct prompt. Keep CoT on the 20 percent of traffic that benefits from it.
  4. Pair CoT with grounded citations. Validate citations as substrings of the retrieved context. Reject hallucinated citations before they reach the user.
  5. Cap the reasoning length in the prompt. "Use at most 3 reasoning steps" prevents the model from rambling and keeps token costs bounded.

The headline: chain-of-thought reasoning in RAG is a prompt change plus a JSON parser. It lifts accuracy on multi-step questions by 10 to 20 points and does nothing for simple lookups. Route wisely and the extra cost is worth it on every call that needs it.

Frequently asked questions

What is chain-of-thought reasoning in RAG?

It is a prompting technique where the model writes out intermediate reasoning steps before producing a final answer. In a RAG context, the reasoning explains how the model used the retrieved chunks to arrive at the answer. This forces the model to process multi-step questions deliberately instead of guessing a direct answer, which improves accuracy on complex queries.

When should I use CoT in a RAG pipeline?

For multi-step questions, comparison questions, and questions with distracting retrieved chunks. Skip it for single-fact lookups, simple yes/no questions, and high-volume FAQ traffic where the extra tokens are wasted. A classifier up front can route complex questions through CoT and keep easy ones on a direct prompt, which is how most production systems use it.

How do I parse CoT output reliably?

Ask the model for JSON with separate reasoning and answer fields. JSON parsing is more reliable than regex on free-form text, and the separation lets downstream code use the answer for the user-facing response and the reasoning for audit logs. Validate the JSON shape with Pydantic to catch malformed responses.

Should I show the reasoning to the user?

Usually no. Log it for debugging and eval but show only the final answer in the UI. Exceptions: when the user explicitly asks for explanations, when the question is complex enough that citing evidence is part of the value, or when you want to build trust by showing the work. A collapsible detail under the answer is a good compromise.

How do I prevent CoT from hallucinating its own reasoning?

Pair CoT with grounded citations. Require the model to cite verbatim substrings from the retrieved context for each step of its reasoning. Validate the citations as substring checks against the context after generation. If any citation is not a literal substring, the model hallucinated and you should reject the answer or retry.

Key takeaways

  1. Single-shot RAG answers fail on multi-step questions because the model cannot correct itself once it commits to a framing. CoT adds explicit intermediate steps that catch the errors.
  2. A CoT prompt asks for reasoning and final answer as 2 separate JSON fields. The separation enables reliable parsing and clean separation of audit trail from user response.
  3. CoT wins on multi-step, comparison, and distractor-heavy questions. It adds no value on single-fact lookups, so route wisely with a question classifier.
  4. Log the reasoning, show only the answer. Use the reasoning for debugging, eval, and training data, not for the user-facing response.
  5. Pair CoT with grounded citations and substring validation. Unvalidated reasoning can introduce its own hallucinations.
  6. To see CoT wired into a full production RAG stack with reranking, grounding, and self-correction, walk through the Agentic RAG Masterclass, or start with the RAG Fundamentals primer.

For the original chain-of-thought paper and its experimental results, see Wei et al., Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. The accuracy lift on multi-step tasks in that paper matches what you see in production RAG when you apply CoT selectively.

Share this post

Continue Reading

Weekly Bytes of AI

Technical deep-dives for engineers building production AI systems.

Architecture patterns, system design, cost optimization, and real-world case studies. No fluff, just engineering insights.

Unsubscribe anytime. We respect your inbox.

Ready to go deeper?

Go beyond articles. Build production AI systems with hands-on workshops and our intensive AI Bootcamp.