Advanced RAG with query rewriting and evaluation
Add query rewriting, sub-graphs, PII scrubbing, and RAGAS scoring to a production RAG pipeline.
Loading...
Evaluation is the quiet skill that separates engineers who ship LLM apps from engineers who demo them. Without a real evaluation harness, every prompt change is a vibe-check, every model upgrade is a coin toss, and every RAG tweak is a hope. With one, you change the system on purpose and know the next morning whether it got better.
Curated by Param Harrison
These courses teach the practical version. Build a fixture set of representative questions, pick measurable output properties (exact match, semantic similarity, rubric score, tool-call sequence), run RAGAS or a custom LLM-as-judge, and wire it into CI so regressions block merges. Nothing academic; every pattern is tuned to a real failure mode you will actually hit.
Showing 2 of 2 courses
Common questions
Before you have a prompt worth protecting. For a throwaway prototype, eyeball testing is fine. The moment a change to the prompt could hurt real users, you want an eval harness.
Both, depending on what you are measuring. RAGAS is built for retrieval quality (context recall, faithfulness). LLM-as-judge is a general-purpose pattern for rubric scoring on open-ended outputs. Production evals usually combine them.
Around 30 to 100 questions to start. Big enough to catch real regressions, small enough that you can read every failure case. Grow it only when a specific failure slips through and you need a fixture to protect against it.
Two patterns. First, use a stronger model as judge than the one being graded. Second, periodically spot-check judge output against human review so you know the correlation. Both are covered in the advanced RAG evaluation course.
Run the full eval on main, run a fast subset on every PR, and fail the build when a core metric drops below threshold. The pattern is closer to integration testing than unit testing, and the course has a ready-to-adapt GitHub Actions setup.