When is it too early for evals?

Before you have a prompt worth protecting. For a throwaway prototype, eyeball testing is fine. The moment a change to the prompt could hurt real users, you want an eval harness.

RAGAS or LLM-as-judge?

Both, depending on what you are measuring. RAGAS is built for retrieval quality (context recall, faithfulness). LLM-as-judge is a general-purpose pattern for rubric scoring on open-ended outputs. Production evals usually combine them.

How big should my eval set be?

Around 30 to 100 questions to start. Big enough to catch real regressions, small enough that you can read every failure case. Grow it only when a specific failure slips through and you need a fixture to protect against it.

How do I trust an LLM to grade another LLM?

Two patterns. First, use a stronger model as judge than the one being graded. Second, periodically spot-check judge output against human review so you know the correlation. Both are covered in the advanced RAG evaluation course.

How does evaluation fit into CI?

Run the full eval on main, run a fast subset on every PR, and fail the build when a core metric drops below threshold. The pattern is closer to integration testing than unit testing, and the course has a ready-to-adapt GitHub Actions setup.

47% OFFYearly Pro

$30/mo$16/mobilled yearlyGet Pro

47% OFFYearly Pro$30/mo$16/mobilled yearlyGet Pro

Skill track

LLM Evaluation courses

Evaluation is the quiet skill that separates engineers who ship LLM apps from engineers who demo them. Without a real evaluation harness, every prompt change is a vibe-check, every model upgrade is a coin toss, and every RAG tweak is a hope. With one, you change the system on purpose and know the next morning whether it got better.

Curated by Param Harrison

These courses teach the practical version. Build a fixture set of representative questions, pick measurable output properties (exact match, semantic similarity, rubric score, tool-call sequence), run RAGAS or a custom LLM-as-judge, and wire it into CI so regressions block merges. Nothing academic; every pattern is tuned to a real failure mode you will actually hit.

Showing 2 of 2 courses

Advanced RAG with query rewriting and evaluation

Add query rewriting, sub-graphs, PII scrubbing, and RAGAS scoring to a production RAG pipeline.

AdvancedPro

View course

Agent evaluation techniques

Stop shipping agents you cannot defend. Learn the eval patterns LangSmith-backed teams actually use.

AdvancedPro

View course

Common questions

LLM Evaluation: quick answers

When is it too early for evals?
Before you have a prompt worth protecting. For a throwaway prototype, eyeball testing is fine. The moment a change to the prompt could hurt real users, you want an eval harness.
RAGAS or LLM-as-judge?
Both, depending on what you are measuring. RAGAS is built for retrieval quality (context recall, faithfulness). LLM-as-judge is a general-purpose pattern for rubric scoring on open-ended outputs. Production evals usually combine them.
How big should my eval set be?
Around 30 to 100 questions to start. Big enough to catch real regressions, small enough that you can read every failure case. Grow it only when a specific failure slips through and you need a fixture to protect against it.
How do I trust an LLM to grade another LLM?
Two patterns. First, use a stronger model as judge than the one being graded. Second, periodically spot-check judge output against human review so you know the correlation. Both are covered in the advanced RAG evaluation course.
How does evaluation fit into CI?
Run the full eval on main, run a fast subset on every PR, and fail the build when a core metric drops below threshold. The pattern is closer to integration testing than unit testing, and the course has a ready-to-adapt GitHub Actions setup.

Or browse every course

LLM Evaluation courses

Advanced RAG with query rewriting and evaluation

Agent evaluation techniques

LLM Evaluation: quick answers

When is it too early for evals?

RAGAS or LLM-as-judge?

How big should my eval set be?

How do I trust an LLM to grade another LLM?

How does evaluation fit into CI?

Related paths

LLM Evaluation courses

Advanced RAG with query rewriting and evaluation

Agent evaluation techniques

LLM Evaluation: quick answers

When is it too early for evals?

RAGAS or LLM-as-judge?

How big should my eval set be?

How do I trust an LLM to grade another LLM?

How does evaluation fit into CI?

Related paths

LLM Evaluation courses

Create your free account

Advanced RAG with query rewriting and evaluation

Agent evaluation techniques

LLM Evaluation: quick answers

When is it too early for evals?

RAGAS or LLM-as-judge?

How big should my eval set be?

How do I trust an LLM to grade another LLM?

How does evaluation fit into CI?

Related paths

LLM Evaluation courses

Create your free account

Advanced RAG with query rewriting and evaluation

Agent evaluation techniques

LLM Evaluation: quick answers

When is it too early for evals?

RAGAS or LLM-as-judge?

How big should my eval set be?

How do I trust an LLM to grade another LLM?

How does evaluation fit into CI?

Related paths