Query rewriting in RAG with LLMs

Your users ask "how does this work" and your retriever returns random chunks

The user types "how does this work". Your retriever embeds the string and returns 5 chunks about random topics because "how does this work" is semantically uninformative. The agent produces a generic answer and the user is unhappy. The problem is not the retriever. The problem is that the query is too vague to retrieve on.

The fix is query rewriting: a cheap LLM call between the user's input and the retriever. The rewriter takes the raw query plus recent conversation context and produces 1-3 rephrased queries that are specific enough to retrieve on. The retriever then runs on the rewrites, not the original.

This post is the query rewriting pattern for RAG: the prompt that produces good rewrites, the multi-query fan-out that catches more recall, when rewriting actually hurts, and the 40 percent recall improvement you can measure on vague queries.

Why does a vague query break RAG?

Because the retriever only sees the text of the query, and embedding-based retrieval depends on that text having enough signal. 3 specific failure modes:

Pronoun resolution. "How does it work?" after the user asked about LangGraph, "it" is meaningless to a stateless retriever.
Over-generic queries. "How do I fix this?" has no topic. The retriever returns random popular chunks.
Multi-step intent. "Show me examples and explain the pattern" has 2 intents. One query cannot retrieve for both.

Query rewriting fixes all 3 by expanding the query into one or more specific, retrieval-friendly strings that reflect the user's actual intent.

graph LR
    User["User query: how does it work?"] --> Rewrite[LLM rewriter]
    History[Conversation history] --> Rewrite
    Rewrite --> Q1["Rewrite 1: LangGraph state flow"]
    Rewrite --> Q2["Rewrite 2: LangGraph node execution order"]
    Q1 & Q2 --> Retrieve[Retriever]
    Retrieve --> Merge[Merged chunks]
    Merge --> LLM[Final LLM answer]

    style Rewrite fill:#dbeafe,stroke:#1e40af
    style Merge fill:#dcfce7,stroke:#15803d

What does the rewriter prompt look like?

Tight, with the conversation context included. Ask for 2-3 specific rewrites in JSON.

# filename: app/rag/rewriter.py
# description: LLM-based query rewriter for vague user questions.
from anthropic import Anthropic
import json

client = Anthropic()

REWRITE_PROMPT = """Rewrite the user's query into 2-3 specific, retrieval-friendly search queries. The rewrites should:
- Resolve pronouns using the conversation history
- Split compound intents into separate queries
- Expand vague terms with specific technical vocabulary
- Preserve the original meaning

Conversation history (last 3 turns):
{history}

User query: {query}

Output ONLY JSON:
{{"rewrites": ["specific query 1", "specific query 2"]}}
"""


def rewrite_query(query: str, history: list[dict]) -> list[str]:
    history_text = '\n'.join(
        f'{turn["role"]}: {turn["content"][:200]}' for turn in history[-3:]
    )
    reply = client.messages.create(
        model='claude-haiku-4-5-20251001',
        max_tokens=300,
        messages=[{'role': 'user', 'content': REWRITE_PROMPT.format(
            history=history_text, query=query,
        )}],
    )
    return json.loads(reply.content[0].text.strip())['rewrites']

3 decisions to note. The prompt asks for 2-3 rewrites, not one, to catch multi-intent queries. The conversation history is truncated to the last 3 turns (200 chars each) to keep the rewriter fast. Use Haiku because rewriting is a simple task that does not need the flagship model.

For the broader agentic RAG pattern that uses rewriting as one node in a graph, see the Agentic RAG with LangGraph post.

How do you merge results from multiple rewrites?

Query each rewrite against the retriever in parallel, union the results, deduplicate, and pass the top k to the LLM.

# filename: app/rag/multi_query.py
# description: Multi-query retrieval using rewrites for better recall.
import asyncio
from collections import OrderedDict


async def multi_query_retrieve(query: str, history: list[dict], retriever, k: int = 5):
    rewrites = rewrite_query(query, history)
    # Query each rewrite in parallel
    tasks = [retriever.search(r, k=k) for r in rewrites]
    all_results = await asyncio.gather(*tasks)

    # Deduplicate by doc_id, preserving rank order
    seen = OrderedDict()
    for results in all_results:
        for chunk in results:
            if chunk.doc_id not in seen:
                seen[chunk.doc_id] = chunk
    return list(seen.values())[:k]

Total latency: 500 ms for the rewrite + 50 ms for parallel retrieval = ~550 ms added to the RAG pipeline. The recall improvement on vague queries is typically 30-50 percent, which translates directly into better answers.

When does query rewriting hurt?

3 specific cases where rewriting makes things worse.

Clear, specific queries. If the user already types a precise technical query ("how do I fix SQLAlchemy QueuePool limit errors in async FastAPI?"), the rewriter either preserves it or makes it less precise.
High-latency paths. Adding 500 ms to the pipeline is expensive if your p95 budget is tight.
Exact-match queries. Lookup queries like "show me the refund policy" do not need rewriting; they just need precise retrieval.

Route only vague queries through the rewriter. A cheap classifier up front picks the path. For the classifier pattern, see the Dynamic RAG: re-planning retrieval strategies post.

How do you measure rewriting quality?

Track 2 metrics on a labeled eval set.

Recall@k improvement. For each query, compare retrieval recall with and without rewriting. Expect 20-50 percent improvement on vague queries, near-zero on clear queries.
Answer correctness lift. For each query, grade the final answer (LLM-as-judge) with and without rewriting. Expect 5-15 percent improvement on vague queries.

If recall improves but correctness does not, the rewriter is adding noise. Tune the rewrite prompt to be less aggressive.

For the LLM-as-judge evaluation pattern, see the LLM-as-a-judge production evaluation framework post.

What to do Monday morning

Identify 20-30 vague queries from production logs. These are your test set.
Add the rewriter to your RAG pipeline. Start with Haiku and the 3-rewrite prompt from this post.
Run the test queries with and without rewriting. Measure recall@5 and final answer quality.
If recall improves by more than 20 percent on vague queries, ship it. Otherwise tune the prompt.
Add a cheap classifier to route only vague queries through rewriting. Clear queries skip it.
Monitor rewriter latency. If it exceeds 800 ms, drop the number of rewrites from 3 to 2.

The headline: query rewriting is a 50-line Haiku call that lifts recall by 30-50 percent on vague queries. Route only vague queries through it. 550 ms added, much better answers returned.

Frequently asked questions

What is query rewriting in RAG?

Query rewriting is an LLM call that transforms the user's raw query into 1-3 retrieval-friendly rewrites before the retriever runs. The rewrites resolve pronouns, expand vague terms, and split compound intents. The retriever then searches on the rewrites, which typically have 30-50 percent better recall than the original query on vague inputs.

When should I use query rewriting?

For vague user queries like "how does it work" or "what about the other one" that depend on conversation context or contain pronouns. Also for compound queries that combine multiple intents. Skip rewriting for clear, specific queries that already have enough signal for the retriever.

Does query rewriting add too much latency?

About 500 ms for the rewrite call plus 50 ms for parallel retrieval across rewrites. If your p95 latency budget is under 2 seconds, this is acceptable. If it is under 1 second, route only vague queries through the rewriter and skip it for clear ones.

How many rewrites should I generate?

Two or three. One rewrite misses multi-intent queries. Four or more adds latency and token cost without significant recall improvement. The sweet spot is 2-3 rewrites per query, tuned to your specific query distribution.

Can query rewriting hurt retrieval quality?

Yes, on clear specific queries. An over-aggressive rewriter may broaden a precise query into less precise ones, diluting retrieval quality. Route only vague queries through the rewriter, and measure before and after on a labeled eval set to confirm the rewriter is helping more than hurting.

Key takeaways

Vague queries break vector retrieval because the text is too generic to embed meaningfully. Query rewriting fixes this with a cheap LLM call before retrieval.
Use Haiku or gpt-4o-mini for rewriting. The task is simple and does not need the flagship model.
Generate 2-3 rewrites per query, retrieve on all of them in parallel, merge and deduplicate the results.
Route only vague queries through the rewriter. Clear queries skip it to save latency.
Measure recall@k and answer correctness with and without rewriting on a labeled eval set. Expect 20-50 percent recall improvement on vague queries.
To see query rewriting wired into a full production agentic RAG pipeline with LangGraph, walk through the Agentic RAG Masterclass, or start with the RAG Fundamentals primer.

For the original multi-query retrieval pattern used by LangChain, see the LangChain MultiQueryRetriever docs.

Your users ask "how does this work" and your retriever returns random chunks

Why does a vague query break RAG?

Because the retriever only sees the text of the query, and embedding-based retrieval depends on that text having enough signal. 3 specific failure modes:

Pronoun resolution. "How does it work?" after the user asked about LangGraph, "it" is meaningless to a stateless retriever.
Over-generic queries. "How do I fix this?" has no topic. The retriever returns random popular chunks.
Multi-step intent. "Show me examples and explain the pattern" has 2 intents. One query cannot retrieve for both.

Query rewriting fixes all 3 by expanding the query into one or more specific, retrieval-friendly strings that reflect the user's actual intent.

graph LR
    User["User query: how does it work?"] --> Rewrite[LLM rewriter]
    History[Conversation history] --> Rewrite
    Rewrite --> Q1["Rewrite 1: LangGraph state flow"]
    Rewrite --> Q2["Rewrite 2: LangGraph node execution order"]
    Q1 & Q2 --> Retrieve[Retriever]
    Retrieve --> Merge[Merged chunks]
    Merge --> LLM[Final LLM answer]

    style Rewrite fill:#dbeafe,stroke:#1e40af
    style Merge fill:#dcfce7,stroke:#15803d

What does the rewriter prompt look like?

Tight, with the conversation context included. Ask for 2-3 specific rewrites in JSON.

# filename: app/rag/rewriter.py
# description: LLM-based query rewriter for vague user questions.
from anthropic import Anthropic
import json

client = Anthropic()

REWRITE_PROMPT = """Rewrite the user's query into 2-3 specific, retrieval-friendly search queries. The rewrites should:
- Resolve pronouns using the conversation history
- Split compound intents into separate queries
- Expand vague terms with specific technical vocabulary
- Preserve the original meaning

Conversation history (last 3 turns):
{history}

User query: {query}

Output ONLY JSON:
{{"rewrites": ["specific query 1", "specific query 2"]}}
"""


def rewrite_query(query: str, history: list[dict]) -> list[str]:
    history_text = '\n'.join(
        f'{turn["role"]}: {turn["content"][:200]}' for turn in history[-3:]
    )
    reply = client.messages.create(
        model='claude-haiku-4-5-20251001',
        max_tokens=300,
        messages=[{'role': 'user', 'content': REWRITE_PROMPT.format(
            history=history_text, query=query,
        )}],
    )
    return json.loads(reply.content[0].text.strip())['rewrites']

For the broader agentic RAG pattern that uses rewriting as one node in a graph, see the Agentic RAG with LangGraph post.

How do you merge results from multiple rewrites?

Query each rewrite against the retriever in parallel, union the results, deduplicate, and pass the top k to the LLM.

# filename: app/rag/multi_query.py
# description: Multi-query retrieval using rewrites for better recall.
import asyncio
from collections import OrderedDict


async def multi_query_retrieve(query: str, history: list[dict], retriever, k: int = 5):
    rewrites = rewrite_query(query, history)
    # Query each rewrite in parallel
    tasks = [retriever.search(r, k=k) for r in rewrites]
    all_results = await asyncio.gather(*tasks)

    # Deduplicate by doc_id, preserving rank order
    seen = OrderedDict()
    for results in all_results:
        for chunk in results:
            if chunk.doc_id not in seen:
                seen[chunk.doc_id] = chunk
    return list(seen.values())[:k]

When does query rewriting hurt?

3 specific cases where rewriting makes things worse.

Clear, specific queries. If the user already types a precise technical query ("how do I fix SQLAlchemy QueuePool limit errors in async FastAPI?"), the rewriter either preserves it or makes it less precise.
High-latency paths. Adding 500 ms to the pipeline is expensive if your p95 budget is tight.
Exact-match queries. Lookup queries like "show me the refund policy" do not need rewriting; they just need precise retrieval.

Route only vague queries through the rewriter. A cheap classifier up front picks the path. For the classifier pattern, see the Dynamic RAG: re-planning retrieval strategies post.

How do you measure rewriting quality?

Track 2 metrics on a labeled eval set.

Recall@k improvement. For each query, compare retrieval recall with and without rewriting. Expect 20-50 percent improvement on vague queries, near-zero on clear queries.
Answer correctness lift. For each query, grade the final answer (LLM-as-judge) with and without rewriting. Expect 5-15 percent improvement on vague queries.

If recall improves but correctness does not, the rewriter is adding noise. Tune the rewrite prompt to be less aggressive.

For the LLM-as-judge evaluation pattern, see the LLM-as-a-judge production evaluation framework post.

What to do Monday morning

Identify 20-30 vague queries from production logs. These are your test set.
Add the rewriter to your RAG pipeline. Start with Haiku and the 3-rewrite prompt from this post.
Run the test queries with and without rewriting. Measure recall@5 and final answer quality.
If recall improves by more than 20 percent on vague queries, ship it. Otherwise tune the prompt.
Add a cheap classifier to route only vague queries through rewriting. Clear queries skip it.
Monitor rewriter latency. If it exceeds 800 ms, drop the number of rewrites from 3 to 2.

The headline: query rewriting is a 50-line Haiku call that lifts recall by 30-50 percent on vague queries. Route only vague queries through it. 550 ms added, much better answers returned.

Vague queries break vector retrieval because the text is too generic to embed meaningfully. Query rewriting fixes this with a cheap LLM call before retrieval.
Use Haiku or gpt-4o-mini for rewriting. The task is simple and does not need the flagship model.
Generate 2-3 rewrites per query, retrieve on all of them in parallel, merge and deduplicate the results.
Route only vague queries through the rewriter. Clear queries skip it to save latency.
Measure recall@k and answer correctness with and without rewriting on a labeled eval set. Expect 20-50 percent recall improvement on vague queries.
To see query rewriting wired into a full production agentic RAG pipeline with LangGraph, walk through the Agentic RAG Masterclass, or start with the RAG Fundamentals primer.

For the original multi-query retrieval pattern used by LangChain, see the LangChain MultiQueryRetriever docs.

Query rewriting in RAG with LLMs: the rewrite loop

Share this post

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Retry patterns for LLM API errors in production

Weekly Bytes of AI

Ready to go deeper?

Query rewriting in RAG with LLMs: the rewrite loop

Share this post

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Retry patterns for LLM API errors in production

Weekly Bytes of AI

Ready to go deeper?