Choosing the LLM judge for evaluation pipelines
How to pick the LLM that grades your LLM. The cost-quality tradeoffs, the calibration check, and why a weaker judge is sometimes the right call.
Loading...
Explore our latest articles and insights about LLM.
28 posts in total
How to pick the LLM that grades your LLM. The cost-quality tradeoffs, the calibration check, and why a weaker judge is sometimes the right call.
How to test a RAG pipeline for hallucinations systematically. Adversarial prompts, the out-of-scope set, and the metric that catches confabulation.
How to fact-check RAG answers with a second LLM pass that verifies every claim against the retrieved context. The prompt, the rejection rule, and the loop.
How LLM-powered query rewriting fixes vague user questions before retrieval. The prompt, the multi-query fan-out, and when rewriting hurts more than helps.
How to filter irrelevant retrieved chunks with a cheap LLM call before the final answer. The prompt, the batch pattern, and the 40 percent noise reduction.
How to use Langfuse trace data to find where your agent burns tokens. The 4 queries, the cost-per-user view, and the 50 percent savings patterns.
Why LLM judges without explicit reasoning drift, and how chain-of-thought rationales make their scores defensible. The prompt, the parser, the trust.
How to build an LLM-as-a-judge evaluation framework for agentic AI. The prompt, the rubric, the bias controls, and the loop that catches regressions.
How circuit breakers prevent LLM outages from cascading through your agent. The 3 states, the failure window, and the 50-line implementation.
How to survive LLM provider outages with Tenacity retries and fallback models. The retry policy, the fallback chain, and the 60-line pattern.
How to manage context windows in production AI agents. The 4 strategies that keep long sessions bounded without losing critical context.
How to add chain-of-thought reasoning to a RAG pipeline. The prompt, the parsing, and the cases where CoT beats a straight answer by a wide margin.