How is this different from the Advanced RAG Evaluation course?

Advanced RAG Evaluation focuses on the retrieval pipeline using RAGAS and component-wise scoring. This course covers the full agent surface, including answer scoring, trajectory and tool-precision evaluation, simulated user benchmarks, and production feedback loops. The two complement each other, and you should take both if you ship retrieval-heavy agents.

Do I need an existing agent to follow along?

You can follow with the example agent from the source repo, but you will get the most value if you bring your own agent code. The lessons explicitly show how to wire each technique into existing LangChain, LangGraph, or custom-loop agents.

Is LangSmith required?

Yes for the LangSmith-specific patterns, which is most of the course. The dataset, evaluator, and feedback APIs come from LangSmith. The free LangSmith tier is enough for the labs, and the patterns translate to other observability platforms (Phoenix, Langfuse) with adapters covered in the closing module.

What model providers are used in the labs?

Examples use OpenAI and Anthropic models. Any chat-completion provider works with minor swaps. The eval patterns themselves are model-agnostic.

47% OFFYearly Pro

$30/mo$16/mobilled yearlyGet Pro

47% OFFYearly Pro$30/mo$16/mobilled yearlyGet Pro

47% OFFYearly Pro

$30/mo$16/mobilled yearlyGet Pro

47% OFFYearly Pro$30/mo$16/mobilled yearlyGet Pro

Premium course

Score the agent, not the vibe

Name: Agent evaluation techniques
Price: 49 USD
Availability: InStock

Your agent shipped, your traces look fine, and someone in standup just asked if it is actually good. You do not have an answer. This course gives you one. Build the eval datasets, the LLM-as-judge graders, the trajectory checks, and the production feedback loops that turn "looks alright" into a number you can defend.

Enroll Preview curriculum

Still deciding? Ask first.

Message a mentor about fit, prerequisites, or where to start. Replies come on WhatsApp, usually within a day.

Curriculum fit, prerequisites, or where to start
Honest answer, no pressure to enroll

Engineers are learning here from

NVIDIAMICROSOFTGRABWISEPIPEDRIVEBOLTGLIA

Score AI agents the way teams who ship them actually do. Build eval datasets in LangSmith, grade answers and reasoning trajectories, run RAGAS on retrieval, and wire automated feedback into production traces. A guided curriculum with hands-on labs that turns "looks alright" into a number you can defend.

Stop shipping agents you cannot defend. Learn the eval patterns LangSmith-backed teams actually use.

What you'll ship

Real projects, not toy demos.

A reusable LangSmith eval harness wired to your own agent code
Answer-grading evaluators for free-form, structured, and live-data outputs
Trajectory and tool-precision evaluators that catch silent reasoning failures
A RAGAS-powered scorecard with faithfulness, context recall, and answer correctness
A pairwise-comparison runner that picks between two prompt or model variants
A simulated chat user that benchmarks your agent across realistic conversations
A nightly batch job that scores production traces and pushes results back to LangSmith

What you'll learn

You finish able to:

Build LangSmith eval datasets that match your real traffic distribution
Pick the right evaluator (exact match, LLM-as-judge, structured, RAGAS, pairwise) for each output shape
Score agent reasoning with trajectory and tool-precision evaluators that catch silent failures
Run RAGAS-based scorecards on retrieval pipelines and isolate generator quality with component-wise eval
Attach reference-free real-time evaluators as callbacks on production agents
Benchmark conversational agents with simulated users and pairwise model comparisons
Schedule a nightly batch job that enriches production traces with automated feedback scores

Curriculum

The agent evaluation curriculum

01
Foundations and dataset thinking
Why agent eval breaks every assumption from unit testing, how LangSmith reframes the problem, and how to build the first dataset that makes everything downstream possible.
3 lessons
02
Score the answer
Pick the right grader for the output shape. Exact match for facts, LLM-as-judge for prose, structured data evaluators for JSON. Stop misranking your agent because the grader did not fit the answer.
3 lessons
03
Score the reasoning
A right answer can come from broken reasoning. Catch silent regressions with dynamic ground truth, trajectory evaluation, and tool-precision metrics that score the path the agent took.
3 lessons
04
Score the retrieval
Retrieval is where most RAG agents quietly break. Isolate the generator with component-wise eval, run the RAGAS scorecard end-to-end, and use pairwise comparison for the close-call decisions aggregate scores miss.
3 lessons
05
Run evals in production
Move beyond offline scoring. Attach real-time evaluators as callbacks, benchmark conversational agents with simulated users, and run scheduled batch jobs that enrich every production trace with quality scores.
3 lessons

Who it's for

Is this for you?

Engineers shipping agents to production

Your agent works in demos and breaks in subtle ways for real users. You need a measurement layer that catches regressions before customers do.

AI engineers picking between prompt and model variants

You changed the system prompt. Did it actually help? Pairwise comparison and trajectory eval give you an answer, gut feel does not.

ML or platform engineers building shared eval infrastructure

Multiple teams need a consistent way to grade their agents. You want a harness, dataset patterns, and CI integration you can hand to other teams.

Tech leads defending agent quality to stakeholders

You need a credible answer when leadership asks if the agent is improving. Continuous LangSmith dashboards and pairwise reports give you one.

FAQ

Common questions.

How is this different from the Advanced RAG Evaluation course?
Advanced RAG Evaluation focuses on the retrieval pipeline using RAGAS and component-wise scoring. This course covers the full agent surface, including answer scoring, trajectory and tool-precision evaluation, simulated user benchmarks, and production feedback loops. The two complement each other, and you should take both if you ship retrieval-heavy agents.
Do I need an existing agent to follow along?
You can follow with the example agent from the source repo, but you will get the most value if you bring your own agent code. The lessons explicitly show how to wire each technique into existing LangChain, LangGraph, or custom-loop agents.
Is LangSmith required?
Yes for the LangSmith-specific patterns, which is most of the course. The dataset, evaluator, and feedback APIs come from LangSmith. The free LangSmith tier is enough for the labs, and the patterns translate to other observability platforms (Phoenix, Langfuse) with adapters covered in the closing module.
What model providers are used in the labs?
Examples use OpenAI and Anthropic models. Any chat-completion provider works with minor swaps. The eval patterns themselves are model-agnostic.

Pricing

Unlock this course with Pro.

One subscription unlocks every paid course and workshop replay. Pick yearly or monthly.

Unlock with Pro

$30$16/mo

You save 47% with regional pricing

Billed annually. Cancel anytime.

This course plus every paid course
Workshop replays in your library
New releases the day they ship

Still deciding? Ask Param a question

Stop shipping agents you cannot defend.

Engineers who can defend their agent in a postmortem keep their agents in production. The rest get rewritten by the next team.

Enroll

Agent evaluation techniques

From $16/mo with Pro

47% OFFYearly Pro

$30/mo$16/mobilled yearlyGet Pro

47% OFFYearly Pro$30/mo$16/mobilled yearlyGet Pro

47% OFFYearly Pro

$30/mo$16/mobilled yearlyGet Pro

47% OFFYearly Pro$30/mo$16/mobilled yearlyGet Pro

Premium course

Score the agent, not the vibe

Enroll Preview curriculum

Still deciding? Ask first.

Message a mentor about fit, prerequisites, or where to start. Replies come on WhatsApp, usually within a day.

Curriculum fit, prerequisites, or where to start
Honest answer, no pressure to enroll

Engineers are learning here from

NVIDIAMICROSOFTGRABWISEPIPEDRIVEBOLTGLIA

Stop shipping agents you cannot defend. Learn the eval patterns LangSmith-backed teams actually use.

What you'll ship

Real projects, not toy demos.

A reusable LangSmith eval harness wired to your own agent code
Answer-grading evaluators for free-form, structured, and live-data outputs
Trajectory and tool-precision evaluators that catch silent reasoning failures
A RAGAS-powered scorecard with faithfulness, context recall, and answer correctness
A pairwise-comparison runner that picks between two prompt or model variants
A simulated chat user that benchmarks your agent across realistic conversations
A nightly batch job that scores production traces and pushes results back to LangSmith

What you'll learn

You finish able to:

Build LangSmith eval datasets that match your real traffic distribution
Pick the right evaluator (exact match, LLM-as-judge, structured, RAGAS, pairwise) for each output shape
Score agent reasoning with trajectory and tool-precision evaluators that catch silent failures
Run RAGAS-based scorecards on retrieval pipelines and isolate generator quality with component-wise eval
Attach reference-free real-time evaluators as callbacks on production agents
Benchmark conversational agents with simulated users and pairwise model comparisons
Schedule a nightly batch job that enriches production traces with automated feedback scores

Curriculum

The agent evaluation curriculum

01
Foundations and dataset thinking
Why agent eval breaks every assumption from unit testing, how LangSmith reframes the problem, and how to build the first dataset that makes everything downstream possible.
3 lessons
02
Score the answer
Pick the right grader for the output shape. Exact match for facts, LLM-as-judge for prose, structured data evaluators for JSON. Stop misranking your agent because the grader did not fit the answer.
3 lessons
03
Score the reasoning
A right answer can come from broken reasoning. Catch silent regressions with dynamic ground truth, trajectory evaluation, and tool-precision metrics that score the path the agent took.
3 lessons
04
Score the retrieval
Retrieval is where most RAG agents quietly break. Isolate the generator with component-wise eval, run the RAGAS scorecard end-to-end, and use pairwise comparison for the close-call decisions aggregate scores miss.
3 lessons
05
Run evals in production
Move beyond offline scoring. Attach real-time evaluators as callbacks, benchmark conversational agents with simulated users, and run scheduled batch jobs that enrich every production trace with quality scores.
3 lessons

Who it's for

Is this for you?

Engineers shipping agents to production

Your agent works in demos and breaks in subtle ways for real users. You need a measurement layer that catches regressions before customers do.

AI engineers picking between prompt and model variants

You changed the system prompt. Did it actually help? Pairwise comparison and trajectory eval give you an answer, gut feel does not.

ML or platform engineers building shared eval infrastructure

Multiple teams need a consistent way to grade their agents. You want a harness, dataset patterns, and CI integration you can hand to other teams.

Tech leads defending agent quality to stakeholders

You need a credible answer when leadership asks if the agent is improving. Continuous LangSmith dashboards and pairwise reports give you one.

FAQ

Common questions.

How is this different from the Advanced RAG Evaluation course?
Advanced RAG Evaluation focuses on the retrieval pipeline using RAGAS and component-wise scoring. This course covers the full agent surface, including answer scoring, trajectory and tool-precision evaluation, simulated user benchmarks, and production feedback loops. The two complement each other, and you should take both if you ship retrieval-heavy agents.
Do I need an existing agent to follow along?
You can follow with the example agent from the source repo, but you will get the most value if you bring your own agent code. The lessons explicitly show how to wire each technique into existing LangChain, LangGraph, or custom-loop agents.
Is LangSmith required?
Yes for the LangSmith-specific patterns, which is most of the course. The dataset, evaluator, and feedback APIs come from LangSmith. The free LangSmith tier is enough for the labs, and the patterns translate to other observability platforms (Phoenix, Langfuse) with adapters covered in the closing module.
What model providers are used in the labs?
Examples use OpenAI and Anthropic models. Any chat-completion provider works with minor swaps. The eval patterns themselves are model-agnostic.

Pricing

Unlock this course with Pro.

One subscription unlocks every paid course and workshop replay. Pick yearly or monthly.

Unlock with Pro

$30$16/mo

You save 47% with regional pricing

Billed annually. Cancel anytime.

This course plus every paid course
Workshop replays in your library
New releases the day they ship

Still deciding? Ask Param a question

Stop shipping agents you cannot defend.

Engineers who can defend their agent in a postmortem keep their agents in production. The rest get rewritten by the next team.

Enroll

Agent evaluation techniques

From $16/mo with Pro