In our last post, we learned about the standard RAG pipeline. For building RAG with frameworks, see our RAG framework comparison. It's great, but it's like a diligent intern following a rigid checklist:

  1. Retrieve documents
  2. Stuff them into a prompt
  3. Generate an answer

This works perfectly for simple questions. But what happens when the checklist isn't enough?

You ask the intern, "How does our new product compare to our main competitor's?" The intern checks our internal documents, finds nothing about the competitor, and replies, "I don't have that information."

A senior employee wouldn't stop there. They would recognize the gap, decide to look elsewhere (like a public website), find the missing information, and then synthesize a complete answer.

This is the key insight of Advanced RAG. We must move from a fixed checklist to a dynamic, decision-making process. We'll build a system that can route tasks, grade its own work, and correct its mistakes, just like an expert.

Why does the rigid checklist fail?

First, let's prove the problem. We'll create a "vector store" (our company's internal library) with information about our fictional product, "Model-V", but nothing about its competitors.

# filename: example.py
# description: Code example from the post.
# 1. Create our internal-only document collection
documents = [
    "The Model-V is our latest innovation in AI, featuring a 5-trillion parameter architecture.",
    "Model-V's training data includes a proprietary dataset of scientific research papers.",
    "Built on a unique 'Quantum Entanglement' processing core, the Model-V achieves new speeds."
]

# (Code to add these docs to a ChromaDB collection named "product_docs")

Now, let's ask our "simple RAG" intern a comparative question:

query = "How does the Model-V compare to the new Model-Z from our competitor?"

# 1. RETRIEVE: Search our collection
retrieved_docs = collection.query(query_texts=[query], n_results=2)

# This finds docs about Model-V, but nothing about Model-Z.

# 2. AUGMENT & GENERATE: Stuff docs into a prompt
context = "..." # (context about Model-V)
basic_rag_prompt = f"Context: {context}\n\nQuestion: {query}"

# The LLM's (failed) response:
# "The Model-V has a 5-trillion parameter architecture and a
# 'Quantum Entanglement' core. I do not have any information
# on the Model-Z from a competitor."

The simple RAG system fails. It correctly states it doesn't have the answer. This is where an "agentic" approach becomes necessary.

What is an agentic graph?

Instead of a simple, linear checklist, we'll build a graph. A graph is a set of "nodes" (steps) and "edges" (decisions) that connect them. This allows our system to make choices, loop back, and correct itself.

Our agent's logic will look like this:

graph TD
    A[Start] --> B(Route Query)
    B -- "Internal Question" --> C[Retrieve from Vector Store]
    B -- "External Question" --> D[Search the Web]
    C --> E(Grade Documents)
    E -- "Good Docs" --> F[Generate Answer]
    E -- "Bad Docs" --> D
    D --> F
    F --> G[End]

Let's look at the key "nodes" or "brain cells" of our new agent.

1. The router (the triage specialist)

The first step is a "Router" node. This is a small LLM call that only decides where to look first.

# The Router's logic
def route_query(question):
    prompt = f"""
    You are an expert at routing a user question.
    Use 'vectorstore' for specific questions about Model-V's features.
    Use 'web_search' for all other questions, especially comparisons.
    
    Question: {question}
    
    Where should I look?
    """
    # LLM call will return "vectorstore" or "web_search"
    return call_llm(prompt)

2. The tools (the "hands")

The agent needs tools to interact with the world. We'll give it two:

  • Vector Store Retriever: The tool we built in previous lessons to search our internal product_docs.
  • Web Search: A tool that can search the live internet (e.g., using DuckDuckGoSearchRun()).

3. The grader (the quality control)

This is the most important node. After retrieving documents, the "Grader" node checks if they are actually good enough to answer the question. This is our self-correction loop.

# The Grader's logic
def grade_documents(question, documents):
    prompt = f"""
    You are a grader. Your task is to determine if the
    retrieved documents are relevant and sufficient to
    answer the user's question.
    
    Respond with a single word: 'yes' or 'no'.
    Documents: {documents}
    Question: {question}
    """
    # LLM call will return "yes" or "no"
    return call_llm(prompt)

4. The generator (the voice)

This is the final LLM call we're familiar with. It takes the high-quality, graded context and synthesizes the final answer.

How does the self-correcting agent work?

Now, let's run our two queries through this new agentic graph.

Query 1: the failing comparative question

Query: "How does the Model-V compare to the new Model-Z?"

  1. Router: Sees "compare" and "Model-Z". Decides: web_search.

  2. Web Search: Runs a DuckDuckGo search for "Model-V vs Model-Z." Finds snippets about both.

  3. Generate: The LLM gets the web search results (context about both models) and synthesizes a complete answer.

  4. Final Answer: "Model-V features a 5-trillion parameter architecture, while web sources indicate Model-Z has a 3-trillion parameter architecture but a faster 'Photonic' core..."

Success! The agent dynamically chose the right tool.

What if the router had made a mistake?

  1. Router: (Mistakenly) decides: vectorstore.

  2. Retrieve: Finds docs about "Model-V" only.

  3. Grade Documents: The grader looks at the docs and the question. It sees info for Model-V but nothing for Model-Z. Decides: no.

  4. The Loop: The "no" decision routes the agent back to the web_search node.

  5. Web Search: Runs the search, finds the missing info.

  6. Generate: Synthesizes the complete answer.

This self-correction makes the system incredibly reliable, even if one of its components makes a mistake.

Query 2: the simple internal question

Query: "Tell me about the processing core of Model-V."

  1. Router: Sees "processing core" and "Model-V". Decides: vectorstore.

  2. Retrieve: Finds the internal doc: "Built on a unique 'Quantum Entanglement' processing core..."

  3. Grade Documents: The grader sees the doc is perfectly relevant. Decides: yes.

  4. Generate: The LLM synthesizes the answer from the retrieved doc.

  5. Final Answer: "The Model-V uses a unique 'Quantum Entanglement' processing core..."

The agent correctly and efficiently answered the question using only internal data, avoiding an unnecessary and slower web search.

Frequently asked questions

When do you need agentic RAG instead of simple RAG pipelines?

Agentic RAG is necessary when you need multiple retrieval sources (internal docs, web search, APIs) and your queries require tool selection. Simple RAG assumes one fixed retrieval strategy and fails on comparative questions. The post shows this with a Model-V vs Model-Z query. Agentic systems route to the right tool first, then self-correct if they chose wrong. This routing plus grading loop is what makes them fundamentally different from simple RAG.

How does document grading improve RAG reliability?

Document grading evaluates whether retrieved documents actually answer the question before the LLM generates a response. Without grading, irrelevant context gets stuffed into the prompt and the LLM generates plausible-sounding wrong answers. The post demonstrates this: when the router picks only internal docs for a comparative query, the grader detects the gap and loops back to web search. This self-correction prevents hallucinations from bad context. One extra LLM call for grading eliminates an entire class of failures.

Should you use web search in your RAG system?

Web search is valuable when your queries span information outside your private documents. The post shows this clearly: a comparative question about Model-V vs a competitor requires web search because internal docs don't cover competitors. However, web search adds latency, API costs, and potential hallucination from unreliable sources. Use it selectively via the router node, not on every query. The key insight is routing logic: let the LLM decide whether to search internally first, and only call web search when grading indicates the internal docs are insufficient.

For the full reference, see the Anthropic agents guide.

Key takeaways

  • Graphs > chains for complexity: Linear chains are fine for simple, fixed tasks. Agentic graphs (using tools like LangGraph) are required for systems that must make decisions, handle branching logic, and recover from errors.
  • Routing adds efficiency: An intelligent router stops the agent from wasting time and money (API calls) searching in the wrong places.
  • Self-correction adds reliability: By grading its own work, the agent can identify its own failures (bad retrieval) and take corrective action (like falling back to a web search).
  • Agents are systems, not prompts: This approach shifts our thinking from just "prompt engineering" to "systems engineering". We are building a logical, stateful system where the LLM is just one (very smart) component.

For more on building production AI systems, check out our AI Bootcamp for Software Engineers.


Take the next step

Share this post

Continue Reading

Weekly Bytes of AI

Technical deep-dives for engineers building production AI systems.

Architecture patterns, system design, cost optimization, and real-world case studies. No fluff, just engineering insights.

Unsubscribe anytime. We respect your inbox.

Ready to go deeper?

Go beyond articles. Build production AI systems with hands-on workshops and our intensive AI Bootcamp.