Advanced RAG with tools: using LlamaIndex
In our last lessons, we built a RAG system from the ground up. We had to manage every single step: loading, chunking (see our chunking guide), embedding, and storing (see our vector databases guide).
This was a great way to learn, but it's like building a car by mining the iron ore yourself. For real-world projects, it's smarter to start with a pre-built engine and chassis.
This is the key insight: Frameworks like LlamaIndex act as a powerful toolkit for RAG. They provide pre-built, optimized components for everything, letting us build sophisticated systems much faster.
How do you ingest PDFs easily?
Remember all the steps we took to manually chunk, embed, and store our text? LlamaIndex handles complex files like PDFs in just two lines.
It bundles a PDF loader, a smart text splitter, an embedding model, and a vector store into one simple command.
# filename: example.py
# description: Code example from the post.
# LlamaIndex abstracts away all the complexity
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
# 1. Load all documents from a folder (it handles PDFs, .txt, etc.)
documents = SimpleDirectoryReader("./data").load_data()
# 2. Build the index. This automatically handles:
# - Chunking
# - Embedding
# - Storing in a vector index
book_index = VectorStoreIndex.from_documents(documents)
# 3. Get a query engine and ask a question
query_engine = book_index.as_query_engine()
response = query_engine.query("What is the main character's goal?")
What took us dozens of lines before is now done in two. This abstraction is the primary power of using a framework.
How do you reason over multiple books?
What if the answer isn't in one document? What if it's spread across two different data sources?
For example, you have:
- Book Index: The full text of a novel.
- Quotes Index: A separate document with just famous quotes.
A user asks: "What is a famous quote by Dumbledore about dreams, and what is the context of dreams in the book?"
A simple RAG system would fail. We need an engine that can:
- Break the question into sub-questions.
- Route each sub-question to the correct tool (the right "book").
- Combine the answers.
LlamaIndex calls this a Sub-Question Query Engine.
graph TD
A["Complex Query: 'Plot? and Quote?'"] --> B(Sub-Question Engine)
B --> C["Sub-Question 1: 'Plot?'"]
B --> D["Sub-Question 2: 'Quote?'"]
C --> E["Tool 1: Book Index"]
D --> F["Tool 2: Quotes Index"]
E & F --> G(Synthesize Final Answer)
We give each of our indexes a "tool" wrapper with a clear description:
# 1. Create a tool for each of our "books"
book_tool = QueryEngineTool.from_defaults(
query_engine=book_query_engine,
name="Book_Content",
description="Useful for answering questions about the plot, characters, and events in the book."
)
quotes_tool = QueryEngineTool.from_defaults(
query_engine=quotes_query_engine,
name="Famous_Quotes",
description="Useful for finding specific quotes from Albus Dumbledore."
)
# 2. Build the Sub-Question Engine
# This engine uses the tool descriptions to route questions
sub_question_engine = SubQuestionQueryEngine.from_defaults(
query_engine_tools=[book_tool, quotes_tool]
)
# 3. Ask the complex question
response = sub_question_engine.query(
"What is a Dumbledore quote about dreams, and what is the context of dreams in the book?"
)
The engine uses the tool descriptions to figure out where to find each piece of the answer. It asks the Famous_Quotes tool for the quote and the Book_Content tool for the context, then combines them.
How does the ReAct agent think and act?
This is the most advanced step. What if the answer isn't in any of our local documents?
We need to give our system a new tool: Web Search.
But now the system has to choose which tool to use. For this, we build a ReAct Agent. "ReAct" stands for a Reason → Act → Observe loop. It allows the LLM to "think" about a plan.
graph TD
A[User Query] --> B("Reason: 'I need to find X'")
B --> C("Act: 'I will use Tool Y'")
C -- "Tool (e.g., Web Search)" --> D("Observe: 'Here's the result'")
D --> E{Is question answered?}
E -- No --> B
E -- Yes --> F[Final Answer]
The agent's "thinking" process looks like this:
- Reason: The user is asking for a real-world fact, not in my book.
- Act: I will choose the
Web_Searchtool. - Observe: The web search gives me the fact.
- Reason: I now have the answer.
- Act: I will generate the final response.
We build this by giving the agent a list of all its available tools:
# 1. Create a new tool for web search
web_search_tool = FunctionTool.from_defaults(
fn=DuckDuckGoSearchRun().run,
name="Web_Search",
description="Useful for searching the web for information not in the local documents."
)
# 2. Create a *new* list of ALL tools
all_tools = [book_tool, quotes_tool, web_search_tool]
# 3. Build the ReAct Agent
agent = ReActAgent.from_tools(all_tools, verbose=True)
# 4. Ask a question that REQUIRES web search
response = agent.chat(
"What was the real-world historical inspiration for the Philosopher's Stone?"
)
The agent (with verbose=True) will show its "thoughts". It will first check the Book_Content tool, fail to find the answer, and then decide to use the Web_Search tool to find the real-world history of alchemy, all in one "run".
Frequently asked questions
How do I query answers across multiple data sources in RAG?
Use a sub-question query engine. It breaks your query into targeted sub-questions, routes each to the correct tool (your different indexes), and synthesizes answers together. LlamaIndex does this with tool descriptions, give each data source a tool wrapper explaining what it contains, and the engine routes intelligently. This beats treating all sources the same, which fails when an answer requires combining information from both a quotes index and narrative context.
How do I add web search to a RAG system?
Build a ReAct agent and give it multiple tools: local search and web search. The agent reasons about which tool fits the question, acts by calling it, observes the result, and decides if it needs more. This planning loop lets production systems handle questions beyond your documents, the agent checks local search first, fails, then pivots to web search without manual routing logic.
Should I use a framework like LlamaIndex or build RAG from scratch?
Use a framework. Frameworks compress weeks of manual work, chunking strategies, embedding pipelines, vector store setup, into pre-built, tested components. LlamaIndex gets you to production in days instead of building from scratch. You get sensible defaults and only customize where real workloads demand it. The abstraction pays for itself the moment you hit your second data source or need a tool decision.
For the full reference, see the LlamaIndex documentation.
Key takeaways
- Frameworks accelerate development: Tools like LlamaIndex handle the "plumbing" of RAG (loading, chunking, embedding), letting you focus on building high-level logic
- Sub-questioning solves multi-doc RAG: The
SubQuestionQueryEngineis a powerful way to break down complex questions and get answers from multiple different data sources - Agents need tools to be powerful: A RAG system limited to its own documents is a "closed book". By giving an agent tools (like web search), you create an "open-book" system that can answer a much wider range of questions
- ReAct is a core thinking loop: The Reason → Act → Observe cycle is how you build agents that can reason, make decisions, and use different tools to solve a problem
For more on building production AI systems, check out our AI Bootcamp for Software Engineers.
Take the next step
- RAG Fundamentals Workshop, Build a production RAG pipeline hands-on
Continue Reading
Ready to go deeper?
Go beyond articles. Build production AI systems with hands-on workshops and our intensive AI Bootcamp.