47% OFFYearly Pro
$30/mo$16/mobilled yearlyGet Pro
Topic

Evaluation

Explore our latest articles and insights about Evaluation.

Explore posts

17 posts in total

LLM Engineering

Choosing the LLM judge for evaluation pipelines

How to pick the LLM that grades your LLM. The cost-quality tradeoffs, the calibration check, and why a weaker judge is sometimes the right call.

EvaluationLLM+3
Read post
8 min
LLM Engineering

Ground truth vs relevancy in RAG evaluation

Why ground truth and relevancy measure different things in RAG evals. When to use each, how to build both datasets, and the 2 metrics that matter most.

RAGEvaluation+3
Read post
9 min
LLM Engineering

Hallucination testing for RAG pipelines

How to test a RAG pipeline for hallucinations systematically. Adversarial prompts, the out-of-scope set, and the metric that catches confabulation.

RAGEvaluation+3
Read post
8 min
LLM Engineering

Testing and evaluating RAG pipelines end to end

How to test a RAG pipeline like real software. Unit, integration, and eval tests that catch regressions before they ship. The 3-layer test strategy.

RAGEvaluation+3
Read post
8 min
LLM Engineering

Fact-checking RAG answers: grounding with verification

How to fact-check RAG answers with a second LLM pass that verifies every claim against the retrieved context. The prompt, the rejection rule, and the loop.

RAGLLM+3
Read post
8 min
LLM Engineering

Retriever k-value tuning for RAG: the right top-k

How to pick the right k value for your RAG retriever. The 3-step tuning process, the failure modes of k=3 and k=20, and the sweet spot in between.

RAGVector Databases+3
Read post
8 min
AI Engineering in Practice

Real-time agent debugging with Langfuse traces

How to debug a live agent incident using Langfuse traces. The search patterns, the 5-minute workflow, and the post-mortem that catches the root cause.

ObservabilityAI Agents+3
Read post
8 min
AI Engineering in Practice

Agent cost optimization from trace data

How to use Langfuse trace data to find where your agent burns tokens. The 4 queries, the cost-per-user view, and the 50 percent savings patterns.

ObservabilityAI Agents+3
Read post
9 min
AI Engineering in Practice

Langfuse + Grafana: agentic AI monitoring

How to combine Langfuse traces with Grafana dashboards for agent monitoring. The integration, the panels, and the alerting that catches real problems.

ObservabilityAI Agents+3
Read post
8 min
LLM Engineering

Automated evaluation pipelines for agentic AI systems

How to wire eval pipelines into CI so every agent change is scored automatically. The nightly job, the regression gate, and the dashboard that matters.

EvaluationAI Agents+3
Read post
8 min
LLM Engineering

Dynamic evaluation metric loading in Python

How to load evaluation metrics dynamically in a Python eval pipeline. The registry pattern, entry points, and the test override that makes CI fast.

EvaluationPython+3
Read post
8 min
LLM Engineering

LLM judges: enforcing reasoning with explicit rationales

Why LLM judges without explicit reasoning drift, and how chain-of-thought rationales make their scores defensible. The prompt, the parser, the trust.

EvaluationLLM+3
Read post
9 min

Weekly Bytes of AI

Technical deep-dives for engineers building production AI systems.

Architecture patterns, system design, cost optimization, and real-world case studies. No fluff, just engineering insights.

Unsubscribe anytime. We respect your inbox.