Your eval scores are garbage because you picked the wrong judge

You built an LLM-as-judge eval pipeline. You picked the cheapest model because "it's just grading." Scores came back noisy and inconsistent. Identical answers scored 0.7 on one run and 0.9 on the next. You cannot tell if your new prompt is better than the old one because the judge itself is drifting.

The fix is to pick the judge deliberately, not as an afterthought. Judge selection is a 3-axis problem: the judge has to be capable enough to grade accurately, stable enough to grade consistently, and cheap enough to run on every eval batch. Most teams get 1 of the 3 right and ship broken evals.

This post is the judge-selection pattern: the 3 criteria, the calibration check that tells you if a judge is good enough, how to mix a cheap judge with a flagship referee, and when a weaker model actually produces better evals than a stronger one.

Why does judge selection matter so much?

Because the judge's output becomes your ground truth. If the judge is inconsistent, your entire eval stack is noise. 3 specific failure modes of bad judge selection:

  1. High-variance scores. The same answer scores 0.7 and 0.9 on two runs. You cannot tell if a change improved things or the judge just rolled differently.
  2. Position bias. When comparing two answers, many models prefer whichever appears first, regardless of quality. Your A/B evals become coin flips.
  3. Capability mismatch. A weak judge cannot evaluate strong answers. It scores flagship-model output the same as a mediocre model's output because it cannot tell them apart.

The judge is a measurement instrument. A cheap ruler that wobbles 10 percent makes every measurement useless.

graph LR
    Gen[Generator LLM] --> Ans[Answer]
    Ref[Reference answer] --> Judge[Judge LLM]
    Ans --> Judge
    Judge --> Score[Score 0-1]

    Score --> Check{Stable?<br/>Accurate?<br/>Affordable?}
    Check -->|Yes| Ship[Use it]
    Check -->|No| Swap[Swap judge]

    style Judge fill:#dbeafe,stroke:#1e40af
    style Ship fill:#dcfce7,stroke:#15803d

What are the 3 axes of judge selection?

Capability

The judge has to be at least as capable as the generator you are grading. A Haiku judge cannot reliably grade Sonnet-produced answers because it cannot always tell what is correct. Rule of thumb: if the generator is Sonnet or stronger, use Sonnet or stronger as the judge.

Consistency

Run the same judge on the same query-answer pair 10 times. Measure the standard deviation of the score. If it exceeds 0.05, the judge is too noisy for reliable evals. Consistency matters more than absolute accuracy, a slightly biased judge that never drifts is better than an unbiased judge that wobbles.

Cost

Judges run on every eval batch, often 100-1000 times per day. A Sonnet judge at $3/1M input tokens on 500 daily evals of 2k tokens each costs ~$3/day. A Haiku judge on the same load is $0.25/day. Across a year the difference is $1k. Not huge, but it matters if you are running evals on every PR.

The sweet spot for most teams: Sonnet as the judge for correctness, Haiku for cheap checks like length and format.

How do you calibrate a judge?

Build a 20-row mini dataset with human-labeled scores. Run the candidate judge on the same 20 rows. Compute correlation.

# filename: app/eval/calibrate.py
# description: Calibrate a candidate judge against human labels.
from statistics import mean
from math import sqrt


def pearson(x: list[float], y: list[float]) -> float:
    mx, my = mean(x), mean(y)
    num = sum((xi - mx) * (yi - my) for xi, yi in zip(x, y))
    dx = sqrt(sum((xi - mx) ** 2 for xi in x))
    dy = sqrt(sum((yi - my) ** 2 for yi in y))
    return num / (dx * dy) if dx * dy else 0.0


def calibrate(judge_fn, dataset: list[dict]) -> float:
    human_scores = [row["human_score"] for row in dataset]
    judge_scores = [judge_fn(row["question"], row["answer"]) for row in dataset]
    return pearson(human_scores, judge_scores)

Target correlation: 0.7 or higher. Below that, the judge is not reliable enough. Between 0.7 and 0.85, usable but monitor for drift. Above 0.85, ship it.

For the labeled-dataset pattern that supplies the 20 rows, see the Ground truth vs relevancy in RAG evaluation post.

When is a weaker judge actually better?

3 specific cases where Haiku beats Sonnet as a judge:

  1. Format checks. "Is the output valid JSON?" needs a parser, not a big model. Haiku nails it for pennies.
  2. Length checks. "Is the response under 200 words?" is a count, not a judgment. Haiku is identical to Sonnet here.
  3. Narrow comparisons. "Which of these two answers is more concise?" is a single-axis judgment. Haiku does it cheaply and consistently.

Use Sonnet for multi-dimensional correctness judgments. Use Haiku for scalar checks with clear right answers.

How do you avoid position bias?

Two fixes that together eliminate most of the bias.

  1. Randomize order. When comparing answer A and answer B, randomize which appears first on each call. Average the scores across many calls.
  2. Swap-and-rerun. Call the judge twice, once with A first and once with B first. If the preferred answer differs, the judge is biased on this pair and the comparison is undecided.
# filename: app/eval/unbiased_compare.py
# description: Swap-and-rerun to detect position bias in pairwise comparisons.
import random


def compare_unbiased(judge_fn, answer_a: str, answer_b: str) -> str:
    winner_ab = judge_fn(answer_a, answer_b)  # A first
    winner_ba = judge_fn(answer_b, answer_a)  # B first
    if winner_ab == winner_ba:
        return winner_ab
    return "tie"  # Judge is biased on this pair

Any pair that flips on swap is a tie. Do not count those as wins for either side.

For the broader LLM-as-judge framework that uses this calibrated judge, see the LLM-as-a-judge production evaluation framework post.

What to do Monday morning

  1. List every eval metric in your pipeline. For each, write down the current judge model.
  2. Build a 20-row calibration dataset with human-labeled scores for your most important metric (usually correctness).
  3. Run 3 candidate judges on the 20 rows: Haiku, Sonnet, and one other (GPT-4o-mini, for example). Compute Pearson correlation against human labels.
  4. Pick the cheapest judge that scores above 0.8 correlation. If nothing passes, escalate to the next tier.
  5. For pairwise comparisons, add swap-and-rerun to detect position bias. Treat flips as ties.
  6. Re-calibrate every 3 months. Judge models drift across API updates.

The headline: judge selection is a measurement instrument calibration problem, not a cost-cutting problem. Pick the cheapest judge that correlates with humans, validate it with a 20-row dataset, and monitor drift.

Frequently asked questions

Which LLM should I use as an eval judge?

Match the judge capability to the generator. For correctness judgments on Sonnet-generated answers, use Sonnet or stronger as the judge. For scalar checks like length or format, use Haiku. Calibrate the judge against a 20-row human-labeled dataset before shipping, target a Pearson correlation of at least 0.8.

How do I know if my judge is good enough?

Run a 20-row calibration dataset with human-labeled scores. Compute the Pearson correlation between judge scores and human scores. If correlation is above 0.85 ship it, between 0.7 and 0.85 use with caution, below 0.7 swap judges. Re-calibrate every 3 months to catch drift.

Is Haiku good enough as a judge?

For format checks, length checks, and simple yes/no questions, yes, Haiku is fast and cheap and correlates well with humans. For multi-dimensional correctness judgments on complex answers, no, use Sonnet or stronger. The rule is the judge has to be at least as capable as the generator being graded.

What is position bias in LLM-as-judge?

Position bias is when the judge prefers whichever answer appears first regardless of quality. It affects every current model to some degree. Fix it by running each pairwise comparison twice with swapped order, and treating any pair where the preferred answer flips as a tie.

How much does an LLM judge cost?

Depends on the judge model and eval volume. Sonnet at $3/1M input on 500 daily 2k-token evals is ~$3/day or ~$90/month. Haiku on the same load is ~$7/month. The cost is not the bottleneck for most teams, the calibration and consistency are.

Key takeaways

  1. The judge is a measurement instrument. A noisy judge makes every eval useless, regardless of how good your other metrics are.
  2. Match judge capability to generator capability. A Haiku judge cannot reliably grade Sonnet-generated answers on multi-dimensional correctness.
  3. Calibrate every judge against a 20-row human-labeled dataset. Target Pearson correlation of 0.8 or higher.
  4. Detect position bias with swap-and-rerun. Any pair that flips is a tie.
  5. Use Haiku for scalar checks and Sonnet for correctness judgments. The split saves money without hurting accuracy.
  6. To see calibrated judges wired into a full production evaluation stack, walk through the Agentic RAG Masterclass, or start with the AI Agents Fundamentals primer.

For the research on LLM judge calibration and position bias, see the LLM-as-Judge survey paper.

Share this post

Continue Reading

Weekly Bytes of AI

Technical deep-dives for engineers building production AI systems.

Architecture patterns, system design, cost optimization, and real-world case studies. No fluff, just engineering insights.

Unsubscribe anytime. We respect your inbox.

Ready to go deeper?

Go beyond articles. Build production AI systems with hands-on workshops and our intensive AI Bootcamp.