LLM-based content filtering for RAG pipelines
Your retriever returns 10 chunks and 6 of them are noise
Your RAG retriever returns the top 10 chunks by vector similarity. You send all 10 to the final LLM. The first 4 chunks are exactly what you need. Chunks 5 through 10 are topically close but not actually relevant. The LLM reads all 10, gets distracted by the 6 irrelevant ones, and produces an answer that blends true and false content.
The fix is a cheap LLM filtering step between retrieval and generation. A small model (Haiku, gpt-4o-mini) reads each chunk alongside the query and answers one question: "is this chunk relevant to the query?" Irrelevant chunks are dropped before the final answer. The expensive LLM only sees signal.
This post is the LLM content filtering pattern for RAG: the filter prompt, the batch pattern that keeps it fast, the cost trade-off, and the 40 percent average noise reduction you can expect.
Why does simple retrieval return so much noise?
Because vector similarity is an approximation. Two chunks with similar word distributions score similarly even when only one is actually relevant to the question. 3 specific failure modes:
- Same-topic noise. Your query is "how do I rotate JWT keys?" and the retriever returns a chunk about JWT signing algorithms. Topically related, not what was asked.
- Boilerplate contamination. Docs often contain boilerplate ("see the X documentation for details") that scores high on similarity but adds no signal.
- Near-duplicate drift. Multiple chunks cover the same concept. Keeping all 5 wastes context tokens on redundant information.
An LLM filter fixes all 3 because it can read the chunk against the query and make a semantic decision, not just a geometric one.
graph LR
Query[Query] --> Retrieve[Vector retrieval: top 10]
Retrieve --> Filter[LLM filter: relevant?]
Filter -->|relevant| Keep[Keep: 4 chunks]
Filter -->|not relevant| Drop[Drop: 6 chunks]
Keep --> LLM[Final LLM call]
style Filter fill:#dbeafe,stroke:#1e40af
style Keep fill:#dcfce7,stroke:#15803d
style Drop fill:#fee2e2,stroke:#b91c1c
What does the filter prompt look like?
Tight, yes/no, with the query and chunk inline. The filter model does not need to produce prose; it just needs a boolean decision with a one-line reason.
# filename: app/rag/llm_filter.py
# description: LLM-based content filter for RAG retrieval results.
from anthropic import Anthropic
import json
client = Anthropic()
FILTER_PROMPT = """You are a strict relevance filter for a retrieval pipeline.
Question: {question}
Chunk:
---
{chunk}
---
Is the chunk directly relevant to answering the question? A chunk is "relevant" only if it contains a fact, code example, or statement that would help construct the answer. Topically-related but not-actually-useful chunks are NOT relevant.
Output ONLY JSON in this shape:
{{"relevant": true | false, "reason": "one short sentence"}}
"""
def filter_chunk(question: str, chunk: str) -> dict:
reply = client.messages.create(
model='claude-haiku-4-5-20251001',
max_tokens=100,
messages=[{'role': 'user', 'content': FILTER_PROMPT.format(question=question, chunk=chunk)}],
)
return json.loads(reply.content[0].text.strip())
3 decisions make the filter work. The prompt explicitly distinguishes "relevant" from "topically related." Output is strict JSON so parsing is reliable. Use a cheap model (Haiku) so the filter cost is negligible compared to the final LLM call.
For the related quote extraction pattern that takes filtering further, see the Advanced RAG quote extraction for context compression post.
How do you filter chunks in parallel?
Run all filter calls concurrently with asyncio.gather. 10 chunks, 1 model, parallel execution. Total filter latency is roughly equal to a single call (~500 ms for Haiku) instead of 10x (5 seconds).
# filename: app/rag/filter_batch.py
# description: Parallel chunk filtering with asyncio.gather.
import asyncio
async def filter_chunks(question: str, chunks: list[str]) -> list[str]:
tasks = [filter_chunk_async(question, chunk) for chunk in chunks]
results = await asyncio.gather(*tasks)
return [chunk for chunk, result in zip(chunks, results) if result.get('relevant')]
async def filter_chunk_async(question: str, chunk: str) -> dict:
# Same as filter_chunk above but using AsyncAnthropic
...
Parallel filtering adds about 500 ms to the RAG pipeline. The savings come from sending 40 percent fewer tokens to the final (expensive) LLM call, which typically saves 1-3 seconds of latency and 30-50 percent of token cost.
What does the cost trade-off look like?
For a typical RAG pipeline processing a 10-chunk retrieval:
| Stage | No filter | With filter |
|---|---|---|
| Retrieval | 50 ms, $0 | 50 ms, $0 |
| Filter (Haiku) | 0 ms, $0 | 500 ms, $0.002 |
| Final LLM (Sonnet) | 6000 ms, $0.025 | 4000 ms, $0.012 |
| Total | 6050 ms, $0.025 | 4550 ms, $0.014 |
Filtering adds 500 ms in parallel and $0.002 in filter cost, but saves 2000 ms and $0.013 on the final call. Net: 25 percent faster, 44 percent cheaper per query.
For the broader RAG cost optimization workflow, see the Agent cost optimization from trace data post.
When should you NOT use LLM filtering?
3 cases where filtering is overkill.
- Very small retrieval (k=3). The overhead of 3 filter calls is not worth it. Just send all 3 to the final LLM.
- High-accuracy retrieval. If your retriever already has recall@k above 95 percent on a labeled eval set, the filter is removing almost nothing and just adds cost.
- Latency-critical paths. If your p95 budget is under 2 seconds, even the 500 ms filter overhead may be too much. Use reranking instead, which is cheaper and nearly as effective.
For retrieval that still needs to improve, see the Retriever k-value tuning for RAG post.
What to do Monday morning
- Measure your current retrieval recall@k on a labeled eval set. If it is below 90 percent at k=10, filtering will help.
- Add the filter step between retrieval and generation. Use Haiku or gpt-4o-mini for the filter model.
- Run filter calls in parallel via
asyncio.gather. Total filter latency stays under 600 ms. - Measure before and after on 50 test queries. Expect a 30-50 percent reduction in context tokens and a 20-40 percent reduction in final LLM latency.
- Alert if the filter drops more than 80 percent of chunks on a typical query. That means your retriever is returning mostly noise and needs tuning at the retrieval level, not just filtering.
The headline: LLM-based content filtering is a 40-line addition to any RAG pipeline that saves 30-50 percent of final LLM cost without losing recall. Run the filter on Haiku, run the answer on Sonnet, ship better answers faster.
Frequently asked questions
Why filter retrieval results with an LLM instead of just improving the retriever?
Because no retriever is perfect, and the gap between vector similarity and semantic relevance means 30-50 percent of top-k results are topically related but not actually useful. An LLM filter reads each chunk semantically and drops the noise. Retriever improvements and filtering are complementary, not alternatives.
Does LLM filtering add too much latency?
Not when you run it in parallel. 10 filter calls with asyncio.gather on Haiku add about 500 ms total. The savings come from sending fewer tokens to the final expensive LLM, which typically saves 1-3 seconds. Net: filtering makes the pipeline faster, not slower.
What model should I use for the filter step?
The cheapest, fastest reasoning-capable model your provider offers. Haiku or gpt-4o-mini. The filter task is simple: is this chunk relevant to the query, yes or no. You do not need Sonnet or gpt-4 for this, and using them defeats the cost argument.
How do I know if filtering is actually helping?
Measure token count per query and final latency on 50 test queries before and after adding the filter. Filtering should drop context tokens by 30-50 percent and total latency by 20-40 percent. If neither metric improves, either your retriever is already high-quality or your filter prompt is too permissive.
When should I use reranking instead of LLM filtering?
When p95 latency is under 2 seconds. Cross-encoder rerankers are faster than LLM filters (sub-50 ms for 10 chunks) and nearly as effective. Use the LLM filter when you need the best possible noise reduction and can spend 500 ms on it. Use reranking when every millisecond matters.
Key takeaways
- Vector retrieval returns 30-50 percent noise even with a well-tuned retriever. LLM filtering drops the noise semantically.
- Use a cheap model (Haiku, gpt-4o-mini) for the filter. The filter task is yes/no, not complex reasoning.
- Run filter calls in parallel with
asyncio.gather. Adds 500 ms in parallel, saves 1-3 seconds on the final LLM call. - Typical savings: 30-50 percent token reduction, 20-40 percent latency reduction on the expensive generation step.
- Skip filtering for small k (k=3), high-recall retrievers, or latency-critical paths. Reranking is the alternative for latency-bound systems.
- To see LLM filtering wired into a full production RAG pipeline with reranking and grounding, walk through the Agentic RAG Masterclass, or start with the RAG Fundamentals primer.
For the LangChain documentation on LLM-based relevance filters and document compressors, see the LangChain contextual compression guide.
Continue Reading
Ready to go deeper?
Go beyond articles. Build production AI systems with hands-on workshops and our intensive AI Bootcamp.