Single-pass RAG is the reason half your hard questions fail

You ask your RAG system "what changed in the auth flow between v3 and v4 and why?" The retriever fetches 5 chunks, all about v4 auth. There is nothing about v3 in the result set, because the query was dominated by "auth flow." The model writes a confident, half-correct answer based on what it got, never realizing it answered only half the question.

This is the failure mode of single-pass RAG. One query, one retrieval, one answer. There is no opportunity to notice that the retrieval came back wrong, no opportunity to ask a different question, no opportunity to recover. For simple lookups, single-pass is fine. For multi-hop questions, comparison questions, or questions that require evidence from several different topics, single-pass fails roughly half the time.

Dynamic RAG is the fix. The pipeline becomes a loop: plan a retrieval, execute it, look at what came back, decide if it is enough, and re-plan a new retrieval if not. This post is the pattern, the prompt, the code, and the rule for when it earns its added cost.

Why does one-shot RAG fail on multi-hop questions?

Because retrieval is dominated by the most prominent terms in the query, and multi-hop questions have more than one prominent topic. The retriever cannot tell which topic to weight. It usually picks one and forgets the other.

graph TD
    Q[User: What changed in auth between v3 and v4 and why?] --> R1[Retriever - one query]
    R1 -->|top 5 chunks| Result1[All v4 auth chunks. Zero v3.]
    Result1 --> Naive[Naive answer about v4 only]

    Q --> Plan[Planner: split into sub-queries]
    Plan -->|q1: v3 auth| R2a[Retrieve v3 chunks]
    Plan -->|q2: v4 auth| R2b[Retrieve v4 chunks]
    Plan -->|q3: changelog v3 to v4| R2c[Retrieve changelog]
    R2a --> Combine[Combined evidence]
    R2b --> Combine
    R2c --> Combine
    Combine --> Good[Complete answer]

    style Naive fill:#fee2e2,stroke:#b91c1c
    style Good fill:#dcfce7,stroke:#15803d

3 categories of question that single-pass cannot handle:

  1. Comparison questions ("X versus Y") need evidence from both sides. A single embedding-based query collapses both into the same vector and the retriever picks whichever side is more represented in your corpus.
  2. Multi-hop questions ("who built X and what is their previous project?") need a first retrieval to find the answer to part one, then a second retrieval informed by that answer.
  3. Vague questions ("how does this work?") need the planner to look at what came back and refine the query into something specific.

Dynamic RAG handles all 3 by adding a planning step at the front and a re-planning check after each retrieval. The first plan might be wrong. The loop fixes it.

What are the 4 states of a dynamic RAG loop?

The same 4 states as an agent loop, applied to retrieval. Plan, Retrieve, Observe, Decide. The decide step is the new piece.

graph LR
    Q[Question] --> Plan[State 1: Plan]
    Plan --> Retrieve[State 2: Retrieve]
    Retrieve --> Observe[State 3: Observe]
    Observe --> Decide{State 4: Decide}
    Decide -->|enough evidence| Answer[Generate answer]
    Decide -->|gap remains| Plan

    style Plan fill:#dbeafe,stroke:#1e40af
    style Decide fill:#fef3c7,stroke:#b45309
    style Answer fill:#dcfce7,stroke:#15803d

State 1, Plan. An LLM looks at the user question and produces a list of sub-queries to run against the retriever. For "what changed in auth between v3 and v4 and why," this might be [v3 auth flow, v4 auth flow, changelog v3 to v4 auth].

State 2, Retrieve. Each sub-query runs through the normal retriever and returns chunks.

State 3, Observe. The runtime aggregates all chunks across all sub-queries into a single evidence pool.

State 4, Decide. A second LLM call looks at the evidence pool and the original question, and decides whether the evidence is enough to answer. If yes, generate the answer. If no, propose new sub-queries that target the gap and loop back to State 2.

The loop terminates when the decide step says "enough evidence" or when the step counter hits its ceiling. Most questions converge in one or 2 iterations. The few that need more are the ones single-pass would have failed silently.

How do you build a dynamic RAG loop in 80 lines?

3 small functions: a planner, a re-planner, and a controller. The retriever is whatever you already have.

# filename: dynamic_rag.py
# description: A dynamic RAG loop. Plans sub-queries, retrieves, decides
# whether to re-plan, and stops when evidence is sufficient.
import json
from anthropic import Anthropic
from app.retriever import retrieve  # your existing retriever

client = Anthropic()
MAX_ITERATIONS = 3

PLAN_PROMPT = '''You will receive a user question.
Decompose it into 1 to 4 retrieval sub-queries that, together, would let you
answer the question with high confidence.
Output JSON only: {"queries": ["sub-query 1", "sub-query 2"]}.

Question: {question}'''

REPLAN_PROMPT = '''You are deciding whether the evidence below is enough to
answer the question. If it is, set "sufficient" to true and leave "queries" empty.
If a gap remains, set "sufficient" to false and propose 1 to 3 NEW sub-queries
that target the gap. Do not repeat queries you have already run.

Question: {question}
Already-run queries: {previous_queries}
Evidence:
{evidence}

Output JSON only: {{"sufficient": bool, "queries": ["..."]}}.'''


def plan(question: str) -> list[str]:
    reply = client.messages.create(
        model='claude-sonnet-4-6',
        max_tokens=400,
        messages=[{'role': 'user', 'content': PLAN_PROMPT.format(question=question)}],
    )
    return json.loads(reply.content[0].text)['queries']


def replan(question: str, previous: list[str], evidence: str) -> tuple[bool, list[str]]:
    reply = client.messages.create(
        model='claude-sonnet-4-6',
        max_tokens=400,
        messages=[{'role': 'user', 'content': REPLAN_PROMPT.format(
            question=question, previous_queries=previous, evidence=evidence,
        )}],
    )
    parsed = json.loads(reply.content[0].text)
    return parsed['sufficient'], parsed.get('queries', [])

The 2 prompts are doing all the work. The plan prompt teaches decomposition. The replan prompt teaches the model to look at what came back and notice gaps. Both are short. Both are JSON-forced so the runtime can act on them deterministically.

The controller ties the loop together:

# filename: dynamic_rag_controller.py
# description: The loop that calls plan, retrieve, replan until evidence
# is sufficient or the iteration ceiling is hit.
def dynamic_rag(question: str) -> list[str]:
    queries = plan(question)
    evidence_pool = []
    all_queries = list(queries)

    for _ in range(MAX_ITERATIONS):
        for q in queries:
            evidence_pool.extend(retrieve(q))

        evidence_text = '\n---\n'.join(evidence_pool[-20:])  # cap context
        sufficient, new_queries = replan(question, all_queries, evidence_text)

        if sufficient or not new_queries:
            return evidence_pool

        queries = new_queries
        all_queries.extend(new_queries)

    return evidence_pool

That is the entire loop. The controller is 15 lines and most agentic RAG papers boil down to that plus a smarter replanner prompt.

For a richer walkthrough of dynamic RAG with reranking, self-correction, and a real LangGraph implementation, see the Agentic RAG Masterclass. The free RAG Fundamentals primer is the right starting point if you are still building your first single-pass pipeline.

How do you avoid infinite re-planning loops?

3 guardrails. They are the same ones we use in coding agents and they are mandatory here.

First, a hard iteration ceiling. I use 3. Above that, the model is usually chasing context that does not exist in the corpus and you would rather give a partial answer than burn budget. Set the ceiling and trust it.

Second, a check that new sub-queries are actually new. The replanner can get into a loop where it proposes the same sub-query in slightly different words. Compare new queries against the list of already-run queries (case-folded, stripped) and bail out if the planner is repeating itself.

Third, an "out of scope" exit. If the replanner has run twice and the evidence is still insufficient, return what you have plus a note that the answer may be incomplete. This is far better than spinning forever or fabricating.

# filename: guardrails.py
# description: The 3 checks that keep dynamic RAG bounded.
def is_repeat(new_queries: list[str], previous: list[str]) -> bool:
    norm = lambda q: q.lower().strip()
    seen = {norm(q) for q in previous}
    return all(norm(q) in seen for q in new_queries)

Without these, dynamic RAG is the fastest way to waste tokens in the entire RAG stack. With them, it is one of the highest-use improvements you can make.

When does dynamic RAG beat one-shot RAG?

Use dynamic RAG when at least one of these is true for your workload:

Question type One-shot Dynamic RAG
Single fact lookup Wins Overkill
Comparison ("X vs Y") Often fails Wins
Multi-hop ("who built X and their last project") Fails Wins
Vague exploratory Mediocre Wins
High-volume FAQ Wins Too expensive

The decision rule: if your eval set has more than 20 percent comparison or multi-hop questions, dynamic RAG pays for itself in quality. Below that, the extra LLM calls cost more than they earn.

You can also gate it: run a fast classifier on the incoming question to decide whether it needs the full dynamic loop or can be answered by single-pass. Most production systems with both modes run dynamic RAG on 10-20 percent of traffic and single-pass on the rest. That keeps the average cost low and the hard-question quality high.

For a related grounding pattern that pairs well with dynamic RAG, see the JSON Output Parsing for RAG: Grounding with Pydantic post. The combination (dynamic retrieval plus structured grounded output) handles the 2 biggest failure modes in production RAG.

What to do Monday morning

  1. Pull 50 questions from your eval set. Mark each one as single-fact, comparison, multi-hop, or vague. Count the comparison-and-multi-hop bucket.
  2. If that bucket is over 20 percent of your traffic, build the dynamic loop. Start with the plan and replan prompts from this post.
  3. Add the 3 guardrails before you ship: iteration ceiling, repeat detection, partial-answer fallback. They are the difference between a useful loop and a runaway.
  4. Run the same eval set through both pipelines. Compare quality on the comparison and multi-hop bucket specifically. Expect a 15-30 point lift on those questions.
  5. Add a simple classifier in front to route only complex questions through the dynamic loop. Single-pass for the easy 80 percent, dynamic for the hard 20 percent. This is the cheapest way to ship the technique.

The headline: dynamic RAG is single-pass RAG plus a while loop and 2 prompts. The complexity is in deciding when to use it, not in building it.

Frequently asked questions

What is dynamic RAG?

Dynamic RAG is a retrieval pattern where the pipeline plans sub-queries, retrieves, evaluates the result, and re-plans new sub-queries if the evidence is insufficient. It replaces the single retrieval pass of naive RAG with a loop that can recover from a bad first retrieval. The loop terminates when the evidence is enough to answer or when an iteration ceiling fires.

How is dynamic RAG different from agentic RAG?

Agentic RAG is the broader term for any RAG system where an LLM-driven planner makes decisions about retrieval. Dynamic RAG is one specific pattern inside agentic RAG: the re-planning loop. Other agentic RAG patterns include query rewriting, reranking, and self-correction. Dynamic RAG composes well with all of them, and most production systems combine several.

When should I use dynamic RAG instead of single-pass RAG?

Use it when your traffic includes comparison questions ("X versus Y"), multi-hop questions, or vague exploratory questions. Single-pass RAG fails quietly on those because one retrieval cannot cover multiple topics. For simple lookups and FAQ-style questions, single-pass is faster and cheaper, so the right answer is usually a hybrid: classify each question and route only the hard ones through the dynamic loop.

How do you prevent dynamic RAG from looping forever?

3 guardrails. A hard iteration ceiling (3 is a good default). A repeat-detection check that bails out if the replanner proposes the same sub-queries twice. A partial-answer fallback that returns whatever evidence is available with a note that the answer may be incomplete. Without all 3, a misconfigured replanner can chew through token budget on questions the corpus cannot answer.

What does the replanner prompt actually do?

It takes the original question, the list of sub-queries already run, and the evidence collected so far, and returns a JSON object with 2 fields: sufficient (whether the evidence is enough to answer) and queries (new sub-queries to run if it is not). The prompt explicitly tells the model to avoid repeating previous queries and to target the gap in the evidence. Both rules are load-bearing.

Key takeaways

  1. Single-pass RAG fails quietly on comparison, multi-hop, and vague questions because one retrieval cannot cover multiple topics.
  2. Dynamic RAG is single-pass plus a while loop. Plan sub-queries, retrieve, observe, decide, re-plan if needed, stop when sufficient.
  3. The whole loop is roughly 80 lines of Python. The intelligence is in the plan and replan prompts, not the controller.
  4. 3 guardrails are mandatory: iteration ceiling, repeat detection, partial-answer fallback. Without them dynamic RAG is the fastest way to burn tokens in production.
  5. Use a classifier to route easy questions to single-pass and hard questions to dynamic. Hybrid mode keeps average cost low and hard-question quality high.
  6. To see this pattern wired into a production agentic RAG pipeline alongside reranking and grounding, walk through the Agentic RAG Masterclass, or start with the RAG Fundamentals primer.

For a survey of the broader agentic RAG design space (including dynamic re-planning, self-correction, and query rewriting), see the Agentic RAG survey paper on arXiv. The patterns there generalize the loop in this post.

Share this post

Continue Reading

Weekly Bytes of AI

Technical deep-dives for engineers building production AI systems.

Architecture patterns, system design, cost optimization, and real-world case studies. No fluff, just engineering insights.

Unsubscribe anytime. We respect your inbox.

Ready to go deeper?

Go beyond articles. Build production AI systems with hands-on workshops and our intensive AI Bootcamp.