Choosing the LLM judge for evaluation pipelines
How to pick the LLM that grades your LLM. The cost-quality tradeoffs, the calibration check, and why a weaker judge is sometimes the right call.
Loading...
Explore our latest articles and insights about Evaluation.
17 posts in total
How to pick the LLM that grades your LLM. The cost-quality tradeoffs, the calibration check, and why a weaker judge is sometimes the right call.
Why ground truth and relevancy measure different things in RAG evals. When to use each, how to build both datasets, and the 2 metrics that matter most.
How to test a RAG pipeline for hallucinations systematically. Adversarial prompts, the out-of-scope set, and the metric that catches confabulation.
How to test a RAG pipeline like real software. Unit, integration, and eval tests that catch regressions before they ship. The 3-layer test strategy.
How to fact-check RAG answers with a second LLM pass that verifies every claim against the retrieved context. The prompt, the rejection rule, and the loop.
How to pick the right k value for your RAG retriever. The 3-step tuning process, the failure modes of k=3 and k=20, and the sweet spot in between.
How to debug a live agent incident using Langfuse traces. The search patterns, the 5-minute workflow, and the post-mortem that catches the root cause.
How to use Langfuse trace data to find where your agent burns tokens. The 4 queries, the cost-per-user view, and the 50 percent savings patterns.
How to combine Langfuse traces with Grafana dashboards for agent monitoring. The integration, the panels, and the alerting that catches real problems.
How to wire eval pipelines into CI so every agent change is scored automatically. The nightly job, the regression gate, and the dashboard that matters.
How to load evaluation metrics dynamically in a Python eval pipeline. The registry pattern, entry points, and the test override that makes CI fast.
Why LLM judges without explicit reasoning drift, and how chain-of-thought rationales make their scores defensible. The prompt, the parser, the trust.