Testing and evaluating RAG pipelines end to end

Your RAG pipeline has zero tests and you ship changes by running `curl` manually

Your RAG pipeline works. You know because you ran 3 curl commands against the /chat endpoint and they looked reasonable. You shipped. A week later a colleague's PR broke the retriever and nobody noticed for 2 days because nothing caught it. The only test was you, and you were on vacation.

The fix is a 3-layer test strategy: unit tests for pure functions (chunking, embeddings, parsers), integration tests for the retrieval-generation loop with stubbed dependencies, and eval tests against a labeled set that measure quality. Each layer catches different bugs, runs at a different cadence, and costs a different amount.

This post is the RAG testing strategy: the 3 layers, what belongs in each, the fixture patterns that make tests fast, and the CI wiring that runs unit + integration on every PR and eval on a nightly cadence.

Why do RAG pipelines get shipped without tests?

Because "eval" feels like the only test that matters, and eval is slow and expensive, so nobody runs it frequently. Meanwhile unit and integration tests are cheap but developers skip them thinking "the real test is eval." 3 specific failure modes:

Silent refactors. You rename a tool or change a Pydantic schema. No unit test catches it because nobody wrote one. The first person to try the feature in staging discovers the breakage.
Integration drift. The retriever and the generator work fine in isolation. The way they talk to each other changes subtly. No integration test catches the subtle change. Production quality drops.
Eval-only shipping. Nightly eval catches regressions but only after they are merged. You ship a broken change, nightly eval flags it, and you roll back the next day. 12 hours of bad quality shipped.

A 3-layer strategy catches bugs at each stage: unit tests in under a second (every save), integration tests in under a minute (every PR), eval tests in 5-10 minutes (nightly).

graph TD
    Save[Dev saves file] --> Unit[Unit tests: <1s]
    Unit -->|pass| PR[PR opened]
    PR --> Integration[Integration tests: <1min]
    Integration -->|pass| Merge[Merge]
    Merge --> Nightly[Nightly eval: 5-10min]
    Nightly -->|fail| Alert[Slack alert]
    Nightly -->|pass| Confidence[Ship next day]

    style Unit fill:#dcfce7,stroke:#15803d
    style Integration fill:#dbeafe,stroke:#1e40af
    style Nightly fill:#fef3c7,stroke:#b45309

What goes in the unit test layer?

Pure functions. Zero I/O. Chunker output shape, embedding vector dimension, Pydantic schema validation, prompt template rendering.

# filename: tests/unit/test_chunker.py
# description: Unit test for a pure chunking function.
from app.rag.chunker import chunk_by_paragraph


def test_chunks_by_paragraph():
    text = "Para 1.\n\nPara 2.\n\nPara 3."
    chunks = chunk_by_paragraph(text, max_chars=20)
    assert len(chunks) == 3
    assert chunks[0] == "Para 1."


def test_chunker_respects_max_chars():
    text = "x" * 100
    chunks = chunk_by_paragraph(text, max_chars=30)
    assert all(len(c) <= 30 for c in chunks)

Unit tests should cover: chunking logic, prompt template rendering with edge cases, Pydantic schema validation, any pure utility in the RAG module. Run every save via pytest-watch.

For the broader modular architecture that unit tests sit inside, see the Modular architectures for agentic AI post.

What goes in the integration test layer?

The full retrieval-generation loop with stubbed infrastructure. Real chunker, real prompt templates, but fake retriever and fake LLM. No network, no real API calls.

# filename: tests/integration/test_rag_flow.py
# description: Integration test for the full RAG pipeline with fakes.
import pytest
from app.rag.pipeline import rag_answer
from dataclasses import dataclass


@dataclass
class FakeRetriever:
    chunks: list[str]
    async def search(self, query, k=5):
        return [{'content': c, 'score': 0.9 - i * 0.1} for i, c in enumerate(self.chunks)]


class FakeLLM:
    def __init__(self, response='stubbed answer'):
        self.response = response
        self.last_prompt = None
    async def complete(self, messages):
        self.last_prompt = messages
        return self.response


@pytest.mark.asyncio
async def test_rag_happy_path():
    retriever = FakeRetriever(chunks=['relevant text 1', 'relevant text 2'])
    llm = FakeLLM('the answer is 42')
    result = await rag_answer('what is the meaning of life?', retriever, llm)
    assert '42' in result
    # The prompt sent to the LLM should include both chunks
    prompt_text = str(llm.last_prompt)
    assert 'relevant text 1' in prompt_text
    assert 'relevant text 2' in prompt_text

Integration tests should cover: retriever → prompt assembly → LLM → output parsing, error branches (empty retrieval, malformed LLM output), and any multi-step loop logic. Run on every PR in CI.

What goes in the eval test layer?

Full pipeline with real dependencies, run against a labeled eval set. Measures retrieval recall, answer quality, and specific failure modes like hallucination rate.

# filename: tests/eval/test_rag_quality.py
# description: Nightly eval test against a labeled question set.
import json
from pathlib import Path
from app.rag.pipeline import rag_answer
from app.eval.judge import judge
import pytest


EVAL_SET = json.loads(Path('tests/eval/fixtures/eval_set.json').read_text())


@pytest.mark.nightly
@pytest.mark.asyncio
async def test_rag_quality():
    scores = []
    for item in EVAL_SET:
        actual = await rag_answer(item['question'], real_retriever, real_llm)
        verdict = judge(item['question'], item['expected'], actual, '')
        scores.append(verdict.mean)
    avg = sum(scores) / len(scores)
    assert avg >= 4.0, f'Quality dropped to {avg}'

The @pytest.mark.nightly marker keeps this out of the default test run. A separate CI job runs pytest -m nightly on a cron schedule.

For the LLM-as-a-judge framework that the eval uses, see the LLM-as-a-judge production evaluation framework post.

How do you avoid flakiness in eval tests?

3 controls that reduce noise in LLM-based eval.

Use temperature=0 on both the agent and the judge. Reduces sampling variance.
Use CoT grading with explicit rationales. Cuts judge variance by 3-4x.
Average across multiple runs. Run the eval 3 times and take the median. Protects against the occasional single-run outlier.

With all 3, the eval score varies by less than 0.1 points across runs, which is tight enough to detect real regressions.

What to do Monday morning

Pick one pure function in your RAG pipeline (chunker, parser, utility). Write 3 unit tests for it. Run them. Watch them pass.
Write one integration test for the full retrieval-generation loop with fake retriever and fake LLM. Run it. Verify it catches a broken prompt template.
Build a labeled eval set of 30 question+expected pairs. Use it in an eval test marked @pytest.mark.nightly.
Wire unit + integration into your PR CI. Wire eval into a nightly GitHub Actions job.
After 2 weeks, check how often each layer catches a bug. Unit tests should catch 40 percent of issues, integration 40 percent, eval 20 percent. If the split is off, add tests to the weak layer.

The headline: RAG pipelines deserve 3 test layers. Unit for pure functions, integration for the loop with fakes, eval for quality. Each catches different bugs at different costs. Run them on the matching cadence and ship with confidence.

Frequently asked questions

Why isn't eval enough as the only test for RAG?

Because eval is slow (5-10 minutes) and expensive, so nobody runs it on every PR. That means regressions ship and are caught only the next day by nightly eval. Unit tests (under a second) and integration tests (under a minute) catch mechanical bugs on every save and every PR, before eval even runs.

What belongs in a RAG unit test?

Pure functions with no I/O: chunker logic, prompt template rendering, Pydantic schema validation, parser output shape. No database, no LLM, no retriever. Unit tests should run in under 100 ms each and be safe to run on every file save.

What belongs in a RAG integration test?

The full retrieval-generation loop with stubbed infrastructure. Real chunker, real prompt templates, but FakeRetriever and FakeLLM implementations. Integration tests verify that the loop wires up correctly and that the LLM sees the expected prompt given a known retrieval result. Run them on every PR in under 1 minute total.

How do I build a labeled eval set?

Sample 30-100 real production queries. For each, manually identify the expected answer (or the documents that contain it). This labeled set is the ground truth for nightly eval. Rotate it every 3 months so it reflects current queries, and keep it in a fixtures file next to the test code.

How do I prevent eval tests from being flaky?

Three controls. Use temperature=0 on both the agent and the judge. Use chain-of-thought grading with explicit rationales (cuts judge variance by 3-4x). Run the eval 3 times and take the median score. Together these reduce variance to under 0.1 points across runs.

Key takeaways

RAG pipelines need 3 test layers: unit (pure functions), integration (loop with fakes), eval (quality with labeled set). Each catches different bugs.
Unit tests run in under 1 second and cover chunking, prompt templates, and Pydantic schemas. Run on every save.
Integration tests run in under 1 minute with FakeRetriever and FakeLLM. Verify the loop wires correctly without touching real APIs.
Eval tests run nightly against a labeled set. Use LLM-as-a-judge with CoT grading and temperature=0 to minimize noise.
Wire unit + integration into PR CI. Wire eval into a separate nightly cron. Do not mix them.
To see this 3-layer test strategy in a full production RAG pipeline, walk through the Agentic RAG Masterclass, or start with the RAG Fundamentals primer.

For the pytest documentation on fixtures, markers, and test organization, see the pytest good practices guide.

Your RAG pipeline has zero tests and you ship changes by running `curl` manually

Why do RAG pipelines get shipped without tests?

Silent refactors. You rename a tool or change a Pydantic schema. No unit test catches it because nobody wrote one. The first person to try the feature in staging discovers the breakage.
Integration drift. The retriever and the generator work fine in isolation. The way they talk to each other changes subtly. No integration test catches the subtle change. Production quality drops.
Eval-only shipping. Nightly eval catches regressions but only after they are merged. You ship a broken change, nightly eval flags it, and you roll back the next day. 12 hours of bad quality shipped.

A 3-layer strategy catches bugs at each stage: unit tests in under a second (every save), integration tests in under a minute (every PR), eval tests in 5-10 minutes (nightly).

graph TD
    Save[Dev saves file] --> Unit[Unit tests: <1s]
    Unit -->|pass| PR[PR opened]
    PR --> Integration[Integration tests: <1min]
    Integration -->|pass| Merge[Merge]
    Merge --> Nightly[Nightly eval: 5-10min]
    Nightly -->|fail| Alert[Slack alert]
    Nightly -->|pass| Confidence[Ship next day]

    style Unit fill:#dcfce7,stroke:#15803d
    style Integration fill:#dbeafe,stroke:#1e40af
    style Nightly fill:#fef3c7,stroke:#b45309

What goes in the unit test layer?

Pure functions. Zero I/O. Chunker output shape, embedding vector dimension, Pydantic schema validation, prompt template rendering.

# filename: tests/unit/test_chunker.py
# description: Unit test for a pure chunking function.
from app.rag.chunker import chunk_by_paragraph


def test_chunks_by_paragraph():
    text = "Para 1.\n\nPara 2.\n\nPara 3."
    chunks = chunk_by_paragraph(text, max_chars=20)
    assert len(chunks) == 3
    assert chunks[0] == "Para 1."


def test_chunker_respects_max_chars():
    text = "x" * 100
    chunks = chunk_by_paragraph(text, max_chars=30)
    assert all(len(c) <= 30 for c in chunks)

Unit tests should cover: chunking logic, prompt template rendering with edge cases, Pydantic schema validation, any pure utility in the RAG module. Run every save via pytest-watch.

For the broader modular architecture that unit tests sit inside, see the Modular architectures for agentic AI post.

What goes in the integration test layer?

The full retrieval-generation loop with stubbed infrastructure. Real chunker, real prompt templates, but fake retriever and fake LLM. No network, no real API calls.

# filename: tests/integration/test_rag_flow.py
# description: Integration test for the full RAG pipeline with fakes.
import pytest
from app.rag.pipeline import rag_answer
from dataclasses import dataclass


@dataclass
class FakeRetriever:
    chunks: list[str]
    async def search(self, query, k=5):
        return [{'content': c, 'score': 0.9 - i * 0.1} for i, c in enumerate(self.chunks)]


class FakeLLM:
    def __init__(self, response='stubbed answer'):
        self.response = response
        self.last_prompt = None
    async def complete(self, messages):
        self.last_prompt = messages
        return self.response


@pytest.mark.asyncio
async def test_rag_happy_path():
    retriever = FakeRetriever(chunks=['relevant text 1', 'relevant text 2'])
    llm = FakeLLM('the answer is 42')
    result = await rag_answer('what is the meaning of life?', retriever, llm)
    assert '42' in result
    # The prompt sent to the LLM should include both chunks
    prompt_text = str(llm.last_prompt)
    assert 'relevant text 1' in prompt_text
    assert 'relevant text 2' in prompt_text

Integration tests should cover: retriever → prompt assembly → LLM → output parsing, error branches (empty retrieval, malformed LLM output), and any multi-step loop logic. Run on every PR in CI.

What goes in the eval test layer?

Full pipeline with real dependencies, run against a labeled eval set. Measures retrieval recall, answer quality, and specific failure modes like hallucination rate.

# filename: tests/eval/test_rag_quality.py
# description: Nightly eval test against a labeled question set.
import json
from pathlib import Path
from app.rag.pipeline import rag_answer
from app.eval.judge import judge
import pytest


EVAL_SET = json.loads(Path('tests/eval/fixtures/eval_set.json').read_text())


@pytest.mark.nightly
@pytest.mark.asyncio
async def test_rag_quality():
    scores = []
    for item in EVAL_SET:
        actual = await rag_answer(item['question'], real_retriever, real_llm)
        verdict = judge(item['question'], item['expected'], actual, '')
        scores.append(verdict.mean)
    avg = sum(scores) / len(scores)
    assert avg >= 4.0, f'Quality dropped to {avg}'

The @pytest.mark.nightly marker keeps this out of the default test run. A separate CI job runs pytest -m nightly on a cron schedule.

For the LLM-as-a-judge framework that the eval uses, see the LLM-as-a-judge production evaluation framework post.

How do you avoid flakiness in eval tests?

3 controls that reduce noise in LLM-based eval.

Use temperature=0 on both the agent and the judge. Reduces sampling variance.
Use CoT grading with explicit rationales. Cuts judge variance by 3-4x.
Average across multiple runs. Run the eval 3 times and take the median. Protects against the occasional single-run outlier.

With all 3, the eval score varies by less than 0.1 points across runs, which is tight enough to detect real regressions.

What to do Monday morning

Pick one pure function in your RAG pipeline (chunker, parser, utility). Write 3 unit tests for it. Run them. Watch them pass.
Write one integration test for the full retrieval-generation loop with fake retriever and fake LLM. Run it. Verify it catches a broken prompt template.
Build a labeled eval set of 30 question+expected pairs. Use it in an eval test marked @pytest.mark.nightly.
Wire unit + integration into your PR CI. Wire eval into a nightly GitHub Actions job.
After 2 weeks, check how often each layer catches a bug. Unit tests should catch 40 percent of issues, integration 40 percent, eval 20 percent. If the split is off, add tests to the weak layer.

RAG pipelines need 3 test layers: unit (pure functions), integration (loop with fakes), eval (quality with labeled set). Each catches different bugs.
Unit tests run in under 1 second and cover chunking, prompt templates, and Pydantic schemas. Run on every save.
Integration tests run in under 1 minute with FakeRetriever and FakeLLM. Verify the loop wires correctly without touching real APIs.
Eval tests run nightly against a labeled set. Use LLM-as-a-judge with CoT grading and temperature=0 to minimize noise.
Wire unit + integration into PR CI. Wire eval into a separate nightly cron. Do not mix them.
To see this 3-layer test strategy in a full production RAG pipeline, walk through the Agentic RAG Masterclass, or start with the RAG Fundamentals primer.

For the pytest documentation on fixtures, markers, and test organization, see the pytest good practices guide.

Testing and evaluating RAG pipelines end to end

Share this post

Share this post

Continue Reading

Which language should you build Redis in? Lessons from rebuilding it 6 times

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Weekly Bytes of AI

Ready to go deeper?

Testing and evaluating RAG pipelines end to end

Share this post

Share this post

Continue Reading

Which language should you build Redis in? Lessons from rebuilding it 6 times

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Weekly Bytes of AI

Ready to go deeper?