RAG optimization: speed, cost, and quality

In our last post, we built a powerful, multi-hop RAG agent. It's smart, it's complex, and it's... slow. And expensive.

Our agent's "brain" (the planner, the executor, the generator) is all using one big, powerful LLM. Every single step is a costly, high-latency API call.

This post is for you if you've ever built a powerful agent and then faced the hard questions from your team:

"Why is this so slow?" (Latency)
"Why is our OpenAI bill so high?" (Cost)

Today, we'll learn how to optimize our RAG agent by balancing the eternal triangle: Speed, Quality, and Cost.

What's the problem with the one-size-fits-all engine?

We're building a race car. We're using a massive, 800-horsepower V8 engine (like GPT-4o) to power everything, the wheels, the windshield wipers, and the radio.

graph TD
    A[Query] --> B(Router GPT-4o: $5.00/M)
    B --> C(Retrieve)
    C --> D(Grade Docs GPT-4o: $5.00/M)
    D --> E(Generate Answer GPT-4o: $5.00/M)
    E --> F[Answer]
    
    style B fill:#ffebee,stroke:#b71c1c
    style D fill:#ffebee,stroke:#b71c1c
    style E fill:#ffebee,stroke:#b71c1c

Why this is bad:

Cost: The Router and Grader nodes are simple classification tasks ("web" vs. "vector", "yes" vs. "no"). Using our most expensive model for these is an incredible waste of money.
Speed (Latency): The Router has to finish before anything else can even start. Using a slow, powerful model here adds 2-3 seconds of "dead air" for the user.

What is asymmetric agent design?

We need to stop using one engine. A production-grade agent uses an asymmetric model stack.

Nano Models (e.g., gpt-4o-mini, Phi-3-mini, Llama-3-8B):
- Jobs: Simple, structured tasks.
- Use for: Routing, Grading, Data Extraction.
- Gives us: ⚡️ Speed and 💰 Low Cost.
Power Models (e.g., GPT-4o, Claude 3.5 Sonnet):
- Jobs: Complex, creative, nuanced tasks.
- Use for: Final Answer Generation.
- Gives us: ✨ Quality.

Let's redesign our agent's brain:

graph TD
    A[Query] --> B(Router GPT-4o-mini: $0.15/M)
    B --> C(Retrieve)
    C --> D(Grade Docs GPT-4o-mini: $0.15/M)
    D --> E(Generate Answer GPT-4o: $5.00/M)
    E --> F[Answer]
    
    style B fill:#e8f5e9,stroke:#388e3c
    style D fill:#e8f5e9,stroke:#388e3c
    style E fill:#e3f2fd,stroke:#0d47a1

Observation: We've just cut our API costs by over 90% and made our agent significantly faster. The user perceives the app as "fast" because the slow Generate step only happens after all the fast routing and retrieval are done.

How do you optimize your RAG pipeline?

Let's look at three practical ways to optimize for Speed, Cost, and Quality.

1. Speed: parallel retrieval (async)

Problem: In our multi-hop agent, we wait for the internal search to finish, then we (maybe) do a web search. This is sequential and slow.

Solution: Do both at the same time.

graph TD
    subgraph SEQUENTIAL["Sequential (Slow)"]
        A[Query] --> B(Internal RAG)
        B --> C(Web Search)
        C --> D[Answer]
    end
    
    subgraph PARALLEL["Parallel (Fast)"]
        E[Query] --> F(Run in Parallel)
        F --> G[Internal RAG]
        F --> H[Web Search]
        G & H --> I(Combine & Rank)
        I --> J[Answer]
    end
    
    style A fill:#ffebee,stroke:#b71c1c
    style E fill:#e8f5e9,stroke:#388e3c

The "How" (Python asyncio):

Instead of calling functions one by one, we use asyncio.gather to run them concurrently.

# filename: example.py
# description: Code example from the post.
import asyncio

async def retrieve_internal(query):
    # ... (code for internal RAG)
    return internal_docs

async def retrieve_web(query):
    # ... (code for web search)
    return web_docs

async def run_parallel_retrieval(query):
    print("--- Starting parallel retrieval ---")
    
    # This runs BOTH functions at the same time
    results = await asyncio.gather(
        retrieve_internal(query),
        retrieve_web(query)
    )
    
    # 'results' will be [ [internal_docs], [web_docs] ]
    all_docs = results[0] + results[1]
    return all_docs

2. Cost: asymmetric models

Problem: Our Router and Grader nodes are using our most expensive GPT-4o model.

Solution: Explicitly tell those nodes to use a cheaper, faster model.

The "How":

When we define our nodes, we just pass in a different client or model name.

# Our "cheap and fast" client for simple tasks
cheap_client = OpenAI(model="gpt-4o-mini")

# Our "smart and slow" client for the final answer
smart_client = OpenAI(model="gpt-4o")

def route_query(state):
    # ... (prompt) ...
    # This call is FAST and CHEAP
    response = cheap_client.chat.completions.create(...)
    return ...

def grade_documents(state):
    # ... (prompt) ...
    # This call is ALSO FAST and CHEAP
    response = cheap_client.chat.completions.create(...)
    return ...

def generate(state):
    # ... (prompt) ...
    # This is our ONLY EXPENSIVE call
    response = smart_client.chat.completions.create(...)
    return {"generation": ...}

3. Quality: re-ranking

Problem: We want to give the LLM the "best" context.

If we Retrieve(k=3), we might miss a key fact. (Low Recall)
If we Retrieve(k=20), we flood the LLM with 19 irrelevant docs. (Low Precision)

Solution: A 2-step process. First, retrieve many docs (for high recall), then use a second, lightweight "Re-Ranker" model to find the best ones (for high precision).

graph TD
    A[Query] --> B(1. Retrieve k=20)
    B --> C[20 Documents]
    C --> D(2. Re-Ranker Find best 3)
    D --> E[Top 3 Documents]
    E --> F(3. Generate)
    F --> G[Answer]
    
    style B fill:#e3f2fd,stroke:#0d47a1
    style D fill:#e0f2f1,stroke:#00695c

Observation: This "Retrieve-then-Rank" pattern is a standard for high-quality RAG. It gives the Generator the "best of both worlds": a wide search (high recall) and focused context (high precision).

Challenge for you

Use Case: Your agent's generate node (using GPT-4o) is still 90% of your total cost.
The Problem: Many user questions are simple, like "What is Model-V?" They don't need GPT-4o's power.
Your Task: How would you implement a "Cost-Saving" step before the generate node? (Hint: Think about a new "Grader" node. What would it grade? How could it route to different generator models?)

Frequently asked questions

How do I reduce RAG latency without sacrificing quality

Use parallel retrieval combined with asymmetric model design. Routing and grading decisions are simple classification tasks, not complex generation, so use cheap fast models for those nodes. Only the final answer generation needs your expensive model. This works because latency is dominated by your slowest sequential step, and separating concerns lets you parallelize retrieval before the slow generation step runs.

Why does my RAG agent have such a high API bill

You're using expensive models for tasks that don't need them. A single LLM for routing, grading, and generation wastes money on simple classification jobs. Use nano models like Llama 3.1 8B for routing decisions and grading, which are fast structured tasks. Reserve GPT-4o only for final generation. This alone cuts costs by 90%, since your routing nodes run far more often than generation.

Should I use different models for each step of my RAG pipeline

Yes. Use asymmetric model stacks where model cost and power match task complexity. Routing and grading are binary or categorical decisions, not creative generation, so cheap models suffice. Only final generation needs your power model. The user perceives your agent as fast because latency bottlenecks disappear and API costs plummet when you stop using one overspecced engine for everything.

For the full reference, see the Original RAG paper by Lewis et al..

Key takeaways

Asymmetric model design saves money: Use cheap, fast models for simple tasks (routing, grading) and expensive models only for complex generation
Parallel retrieval reduces latency: Running multiple retrieval steps concurrently cuts total wait time significantly
Re-ranking improves quality: Retrieve many documents for recall, then re-rank to select the best few for precision
Optimize the critical path: The user perceives speed based on the longest sequential path, optimize that first
Cost and quality are trade-offs: Simple questions don't need expensive models; complex questions do, route accordingly

For more on system optimization, see our streaming at scale guide and our concurrency guide.

For more on building production AI systems, check out our AI Bootcamp for Software Engineers.

Take the next step

RAG Fundamentals Workshop, Build and optimize a production RAG pipeline hands-on

In our last post, we built a powerful, multi-hop RAG agent. It's smart, it's complex, and it's... slow. And expensive.

Our agent's "brain" (the planner, the executor, the generator) is all using one big, powerful LLM. Every single step is a costly, high-latency API call.

This post is for you if you've ever built a powerful agent and then faced the hard questions from your team:

"Why is this so slow?" (Latency)
"Why is our OpenAI bill so high?" (Cost)

Today, we'll learn how to optimize our RAG agent by balancing the eternal triangle: Speed, Quality, and Cost.

What's the problem with the one-size-fits-all engine?

We're building a race car. We're using a massive, 800-horsepower V8 engine (like GPT-4o) to power everything, the wheels, the windshield wipers, and the radio.

graph TD
    A[Query] --> B(Router GPT-4o: $5.00/M)
    B --> C(Retrieve)
    C --> D(Grade Docs GPT-4o: $5.00/M)
    D --> E(Generate Answer GPT-4o: $5.00/M)
    E --> F[Answer]
    
    style B fill:#ffebee,stroke:#b71c1c
    style D fill:#ffebee,stroke:#b71c1c
    style E fill:#ffebee,stroke:#b71c1c

Why this is bad:

Cost: The Router and Grader nodes are simple classification tasks ("web" vs. "vector", "yes" vs. "no"). Using our most expensive model for these is an incredible waste of money.
Speed (Latency): The Router has to finish before anything else can even start. Using a slow, powerful model here adds 2-3 seconds of "dead air" for the user.

What is asymmetric agent design?

We need to stop using one engine. A production-grade agent uses an asymmetric model stack.

Nano Models (e.g., gpt-4o-mini, Phi-3-mini, Llama-3-8B):
- Jobs: Simple, structured tasks.
- Use for: Routing, Grading, Data Extraction.
- Gives us: ⚡️ Speed and 💰 Low Cost.
Power Models (e.g., GPT-4o, Claude 3.5 Sonnet):
- Jobs: Complex, creative, nuanced tasks.
- Use for: Final Answer Generation.
- Gives us: ✨ Quality.

Let's redesign our agent's brain:

graph TD
    A[Query] --> B(Router GPT-4o-mini: $0.15/M)
    B --> C(Retrieve)
    C --> D(Grade Docs GPT-4o-mini: $0.15/M)
    D --> E(Generate Answer GPT-4o: $5.00/M)
    E --> F[Answer]
    
    style B fill:#e8f5e9,stroke:#388e3c
    style D fill:#e8f5e9,stroke:#388e3c
    style E fill:#e3f2fd,stroke:#0d47a1

How do you optimize your RAG pipeline?

Let's look at three practical ways to optimize for Speed, Cost, and Quality.

1. Speed: parallel retrieval (async)

Problem: In our multi-hop agent, we wait for the internal search to finish, then we (maybe) do a web search. This is sequential and slow.

Solution: Do both at the same time.

graph TD
    subgraph SEQUENTIAL["Sequential (Slow)"]
        A[Query] --> B(Internal RAG)
        B --> C(Web Search)
        C --> D[Answer]
    end
    
    subgraph PARALLEL["Parallel (Fast)"]
        E[Query] --> F(Run in Parallel)
        F --> G[Internal RAG]
        F --> H[Web Search]
        G & H --> I(Combine & Rank)
        I --> J[Answer]
    end
    
    style A fill:#ffebee,stroke:#b71c1c
    style E fill:#e8f5e9,stroke:#388e3c

The "How" (Python asyncio):

Instead of calling functions one by one, we use asyncio.gather to run them concurrently.

# filename: example.py
# description: Code example from the post.
import asyncio

async def retrieve_internal(query):
    # ... (code for internal RAG)
    return internal_docs

async def retrieve_web(query):
    # ... (code for web search)
    return web_docs

async def run_parallel_retrieval(query):
    print("--- Starting parallel retrieval ---")
    
    # This runs BOTH functions at the same time
    results = await asyncio.gather(
        retrieve_internal(query),
        retrieve_web(query)
    )
    
    # 'results' will be [ [internal_docs], [web_docs] ]
    all_docs = results[0] + results[1]
    return all_docs

2. Cost: asymmetric models

Problem: Our Router and Grader nodes are using our most expensive GPT-4o model.

Solution: Explicitly tell those nodes to use a cheaper, faster model.

The "How":

When we define our nodes, we just pass in a different client or model name.

# Our "cheap and fast" client for simple tasks
cheap_client = OpenAI(model="gpt-4o-mini")

# Our "smart and slow" client for the final answer
smart_client = OpenAI(model="gpt-4o")

def route_query(state):
    # ... (prompt) ...
    # This call is FAST and CHEAP
    response = cheap_client.chat.completions.create(...)
    return ...

def grade_documents(state):
    # ... (prompt) ...
    # This call is ALSO FAST and CHEAP
    response = cheap_client.chat.completions.create(...)
    return ...

def generate(state):
    # ... (prompt) ...
    # This is our ONLY EXPENSIVE call
    response = smart_client.chat.completions.create(...)
    return {"generation": ...}

3. Quality: re-ranking

Problem: We want to give the LLM the "best" context.

If we Retrieve(k=3), we might miss a key fact. (Low Recall)
If we Retrieve(k=20), we flood the LLM with 19 irrelevant docs. (Low Precision)

Solution: A 2-step process. First, retrieve many docs (for high recall), then use a second, lightweight "Re-Ranker" model to find the best ones (for high precision).

graph TD
    A[Query] --> B(1. Retrieve k=20)
    B --> C[20 Documents]
    C --> D(2. Re-Ranker Find best 3)
    D --> E[Top 3 Documents]
    E --> F(3. Generate)
    F --> G[Answer]
    
    style B fill:#e3f2fd,stroke:#0d47a1
    style D fill:#e0f2f1,stroke:#00695c

Challenge for you

Use Case: Your agent's generate node (using GPT-4o) is still 90% of your total cost.
The Problem: Many user questions are simple, like "What is Model-V?" They don't need GPT-4o's power.
Your Task: How would you implement a "Cost-Saving" step before the generate node? (Hint: Think about a new "Grader" node. What would it grade? How could it route to different generator models?)

Asymmetric model design saves money: Use cheap, fast models for simple tasks (routing, grading) and expensive models only for complex generation
Parallel retrieval reduces latency: Running multiple retrieval steps concurrently cuts total wait time significantly
Re-ranking improves quality: Retrieve many documents for recall, then re-rank to select the best few for precision
Optimize the critical path: The user perceives speed based on the longest sequential path, optimize that first
Cost and quality are trade-offs: Simple questions don't need expensive models; complex questions do, route accordingly

For more on system optimization, see our streaming at scale guide and our concurrency guide.

For more on building production AI systems, check out our AI Bootcamp for Software Engineers.

Take the next step

RAG Fundamentals Workshop, Build and optimize a production RAG pipeline hands-on

RAG optimization: speed, cost, and quality

Share this post

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Ground truth vs relevancy in RAG evaluation

Weekly Bytes of AI

Ready to go deeper?

RAG optimization: speed, cost, and quality

Share this post

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Ground truth vs relevancy in RAG evaluation

Weekly Bytes of AI

Ready to go deeper?