47% OFFYearly Pro

$30/mo$16/mobilled yearlyGet Pro

47% OFFYearly Pro$30/mo$16/mobilled yearlyGet Pro

Topic

Evaluation

Explore our latest articles and insights about Evaluation.

Explore posts

17 posts in total

LLM Engineering

Choosing the LLM judge for evaluation pipelines

How to pick the LLM that grades your LLM. The cost-quality tradeoffs, the calibration check, and why a weaker judge is sometimes the right call.

Ground truth vs relevancy in RAG evaluation

Why ground truth and relevancy measure different things in RAG evals. When to use each, how to build both datasets, and the 2 metrics that matter most.

Hallucination testing for RAG pipelines

How to test a RAG pipeline for hallucinations systematically. Adversarial prompts, the out-of-scope set, and the metric that catches confabulation.

Testing and evaluating RAG pipelines end to end

How to test a RAG pipeline like real software. Unit, integration, and eval tests that catch regressions before they ship. The 3-layer test strategy.

Fact-checking RAG answers: grounding with verification

How to fact-check RAG answers with a second LLM pass that verifies every claim against the retrieved context. The prompt, the rejection rule, and the loop.

Retriever k-value tuning for RAG: the right top-k

How to pick the right k value for your RAG retriever. The 3-step tuning process, the failure modes of k=3 and k=20, and the sweet spot in between.

RAGVector Databases+3

Read post

8 min

AI Engineering in Practice

Real-time agent debugging with Langfuse traces

How to debug a live agent incident using Langfuse traces. The search patterns, the 5-minute workflow, and the post-mortem that catches the root cause.

ObservabilityAI Agents+3

Read post

8 min

AI Engineering in Practice

Agent cost optimization from trace data

How to use Langfuse trace data to find where your agent burns tokens. The 4 queries, the cost-per-user view, and the 50 percent savings patterns.

ObservabilityAI Agents+3

Read post

9 min

AI Engineering in Practice

Langfuse + Grafana: agentic AI monitoring

How to combine Langfuse traces with Grafana dashboards for agent monitoring. The integration, the panels, and the alerting that catches real problems.

ObservabilityAI Agents+3

Read post

8 min

LLM Engineering

Automated evaluation pipelines for agentic AI systems

How to wire eval pipelines into CI so every agent change is scored automatically. The nightly job, the regression gate, and the dashboard that matters.

EvaluationAI Agents+3

Read post

8 min

LLM Engineering