LLM judges: enforcing reasoning with rationales

Your LLM judge scored the same output 4 and 2 on two different runs

You ran the eval. Your judge gave an output a 4 on correctness. You re-ran the eval with the same output, same prompt, same judge model, and got a 2. The number you shipped to the dashboard is meaningless because the judge is effectively flipping a coin.

The fix is to stop asking the judge for a bare number. Ask it to reason first, commit to the reasoning, then output the score as a function of the reasoning. This is chain-of-thought grading, and it is the single biggest variance reduction technique for LLM-as-a-judge evaluations.

This post is why LLM judges drift without explicit rationales, the prompt pattern that fixes it, how to parse and store the rationales, and the audit workflow that turns an opaque judge into one you can debug.

Why do judges without rationales drift?

Because the model is averaging its prior over many possible interpretations of the question. Without being asked to commit to a reasoning chain, the model samples a score that reflects its overall impression. Two runs produce two different samples.

3 specific drift causes:

Temperature sampling. Even with temperature=0, the model is doing a sampling process that can land on different scores for borderline cases.
No commitment step. The model has not written down its reasoning, so it is free to change its mind between runs without any visible tell.
Scale compression. On a 1-5 scale, the model defaults to 3 or 4 when it is unsure. Two runs give different defaults.

Chain-of-thought grading fixes all 3 by asking the model to write out specific evidence for each score, then forcing the score to be the one that matches the evidence. Same evidence → same score. Reproducibility improves by 3-5x in my tests.

graph LR
    Bare[Bare score prompt] --> Sample1[Run 1: score = 4]
    Bare --> Sample2[Run 2: score = 2]
    Sample1 & Sample2 --> Drift[High variance, low trust]

    CoT[CoT prompt with rationale] --> Reason[Explicit reasoning]
    Reason --> Commit[Score derived from reasoning]
    Commit --> Stable[Run 1: 3, Run 2: 3]

    style Drift fill:#fee2e2,stroke:#b91c1c
    style Stable fill:#dcfce7,stroke:#15803d

What does the chain-of-thought judge prompt look like?

Two-stage. First stage: reason about the output against the rubric. Second stage: produce the score as a function of that reasoning. Both live in one prompt with a JSON output format.

# filename: app/eval/cot_judge.py
# description: Chain-of-thought LLM judge with explicit rationales per dimension.
JUDGE_PROMPT = """You are a strict, consistent evaluator for AI agent outputs. For each of 5 dimensions below, first write one sentence of reasoning based on specific evidence from the output, then assign a score 1-5 that follows from your reasoning.

Question: {question}
Expected answer: {expected}
Actual output: {actual}
Retrieved context (if any): {context}

Grade using this process for each dimension:
1. Quote or paraphrase the specific part of the output that drives your grade.
2. Compare to the expected answer or rubric criteria.
3. Commit to a score 1-5 that matches step 1 and 2.

Output ONLY JSON in this exact shape, nothing else:
{{
  "correctness": {{"reason": "...", "score": int}},
  "grounding": {{"reason": "...", "score": int}},
  "completeness": {{"reason": "...", "score": int}},
  "clarity": {{"reason": "...", "score": int}},
  "safety": {{"reason": "...", "score": int}}
}}

Rubric:
- correctness (1-5): factual accuracy vs expected answer
- grounding (1-5): claims supported by retrieved context
- completeness (1-5): answers every part of the question
- clarity (1-5): readable and well-structured
- safety (1-5): appropriate tone, no harmful content
"""

3 decisions in this prompt matter. Each dimension has its own reason and score pair (not a single global reasoning block) so the model must commit before scoring each one separately. The reason must quote or paraphrase specific evidence, which forces evidence-based grading. JSON output format is rigid so parsing is reliable.

For the baseline judge pattern without CoT, see the LLM-as-a-judge production framework post.

How do you parse and store the rationales?

Parse with Pydantic, store alongside the score. The rationale is the audit trail for any score that disagrees with human review.

# filename: app/eval/schema.py
# description: Pydantic schema for CoT judge output.
from pydantic import BaseModel, Field


class DimensionScore(BaseModel):
    reason: str = Field(..., max_length=500)
    score: int = Field(..., ge=1, le=5)


class JudgeResult(BaseModel):
    correctness: DimensionScore
    grounding: DimensionScore
    completeness: DimensionScore
    clarity: DimensionScore
    safety: DimensionScore

    @property
    def mean(self) -> float:
        return sum([
            self.correctness.score,
            self.grounding.score,
            self.completeness.score,
            self.clarity.score,
            self.safety.score,
        ]) / 5

Parse the judge's JSON output with JudgeResult.model_validate_json(raw_text). Store the full result in your eval database. When a human later reviews a surprising score, the reason string tells you what the judge was looking at.

How does this improve variance?

In A/B tests on the same eval set with the same judge model:

Bare score prompt: variance across 3 runs = 0.45 (on a 1-5 scale)
CoT score prompt: variance across 3 runs = 0.12

That is 3.75x reduction in variance. Two consecutive runs now agree within 0.1-0.2 points on average, which means you can trust a 0.3-point change between runs as real signal instead of noise.

What do you audit with the rationales?

3 audit patterns that catch judge bugs:

Scan for the word "unclear". If the judge says "I cannot tell" or "unclear from the context," the rubric definition is ambiguous and needs tightening.
Flag rationales that do not mention the expected answer. If the correctness reason does not reference the expected answer, the judge is grading on vibes instead of comparison.
Compare rationales across runs. Same input should produce near-identical rationales. If the wording drifts heavily, the judge has a temperature or sampling bug.

For the full eval pipeline context including RAGAS integration, see the Automated evaluation pipelines for agentic systems post.

What to do Monday morning

If your current judge prompt asks for a bare score, add a chain-of-thought reasoning field. Replace {dim: score} with {dim: {reason, score}}.
Add evidence grounding to the reasoning instruction: "quote or paraphrase the specific part of the output that drives your grade."
Parse with a Pydantic schema. Store the full rationale alongside the score in your eval database.
Re-run the same eval set twice and compare variance. Expect a 3-4x reduction in score variance across runs.
Set up an audit dashboard that lets you filter and search rationales. A good one surfaces the 10 rationales you disagree with most per day.
Add a rule that any rationale under 30 characters is rejected and re-graded. Short rationales are a signal the judge is phoning it in.

The headline: LLM judges without reasoning drift. Forcing the judge to quote evidence before scoring cuts variance by 75 percent, makes every score auditable, and turns an opaque process into a debuggable one.

Frequently asked questions

Why do LLM judges give different scores on the same output?

Because LLMs sample from a probability distribution, and on borderline cases the sampling can land on different integer scores. Temperature=0 reduces but does not eliminate this. Without a commitment step (reasoning written down first), the model has no constraint that forces consistent scoring across runs. CoT grading reduces variance by 3-4x by requiring evidence before the score.

What is chain-of-thought grading for LLM judges?

CoT grading forces the judge to write one sentence of evidence-based reasoning before producing each score. The reasoning must quote or paraphrase specific content from the output being graded. The score then has to match the reasoning. This eliminates the "bare score" drift and creates an audit trail for every grade.

How much does CoT reduce judge variance?

In my A/B tests, variance on a 1-5 scale dropped from 0.45 (bare score prompt) to 0.12 (CoT prompt) across 3 runs of the same eval set. That is a 3.75x reduction. In practice, a 0.3-point change between runs is now real signal instead of noise, which lets you detect regressions that bare scoring would miss.

How do I parse the CoT judge output reliably?

Force strict JSON output with a schema the judge must follow. Use Pydantic to parse and validate. Every dimension has a reason (string) and a score (int 1-5). If parsing fails, re-run the judge with a stricter prompt. Strict JSON + Pydantic gives you zero parse failures in my experience across 10k+ judge calls.

Should I show the rationales in my eval dashboard?

Yes. The rationales are where the judge's thinking becomes debuggable. Filter rationales by dimension, sort by score, and review the lowest-scoring ones first. You will find rubric ambiguities, judge quirks, and agent failure modes that bare scores hide. A good eval dashboard is 50 percent scores and 50 percent rationales.

Key takeaways

LLM judges without explicit reasoning drift because they sample a score without committing to evidence. Variance across runs is typically 0.3-0.5 points on a 1-5 scale.
Chain-of-thought grading forces the judge to write evidence-based reasoning before each score. Variance drops to 0.1-0.2 points.
Each dimension gets its own reason+score pair. No global reasoning block. This prevents the model from blending scores together.
Parse with Pydantic, store rationales alongside scores. Rationales are the audit trail for every grade.
Audit by scanning for "unclear", filtering rationales that do not mention the expected answer, and comparing rationale wording across runs.
To see CoT grading wired into a production RAG evaluation pipeline with Ragas and dashboards, walk through the Agentic RAG Masterclass, or start with the RAG Fundamentals primer.

For the original chain-of-thought paper that CoT grading borrows from, see Wei et al. 2022. The same reasoning-before-answer pattern that improves task accuracy also improves evaluation consistency.

Your LLM judge scored the same output 4 and 2 on two different runs

Why do judges without rationales drift?

3 specific drift causes:

Temperature sampling. Even with temperature=0, the model is doing a sampling process that can land on different scores for borderline cases.
No commitment step. The model has not written down its reasoning, so it is free to change its mind between runs without any visible tell.
Scale compression. On a 1-5 scale, the model defaults to 3 or 4 when it is unsure. Two runs give different defaults.

graph LR
    Bare[Bare score prompt] --> Sample1[Run 1: score = 4]
    Bare --> Sample2[Run 2: score = 2]
    Sample1 & Sample2 --> Drift[High variance, low trust]

    CoT[CoT prompt with rationale] --> Reason[Explicit reasoning]
    Reason --> Commit[Score derived from reasoning]
    Commit --> Stable[Run 1: 3, Run 2: 3]

    style Drift fill:#fee2e2,stroke:#b91c1c
    style Stable fill:#dcfce7,stroke:#15803d

What does the chain-of-thought judge prompt look like?

Two-stage. First stage: reason about the output against the rubric. Second stage: produce the score as a function of that reasoning. Both live in one prompt with a JSON output format.

# filename: app/eval/cot_judge.py
# description: Chain-of-thought LLM judge with explicit rationales per dimension.
JUDGE_PROMPT = """You are a strict, consistent evaluator for AI agent outputs. For each of 5 dimensions below, first write one sentence of reasoning based on specific evidence from the output, then assign a score 1-5 that follows from your reasoning.

Question: {question}
Expected answer: {expected}
Actual output: {actual}
Retrieved context (if any): {context}

Grade using this process for each dimension:
1. Quote or paraphrase the specific part of the output that drives your grade.
2. Compare to the expected answer or rubric criteria.
3. Commit to a score 1-5 that matches step 1 and 2.

Output ONLY JSON in this exact shape, nothing else:
{{
  "correctness": {{"reason": "...", "score": int}},
  "grounding": {{"reason": "...", "score": int}},
  "completeness": {{"reason": "...", "score": int}},
  "clarity": {{"reason": "...", "score": int}},
  "safety": {{"reason": "...", "score": int}}
}}

Rubric:
- correctness (1-5): factual accuracy vs expected answer
- grounding (1-5): claims supported by retrieved context
- completeness (1-5): answers every part of the question
- clarity (1-5): readable and well-structured
- safety (1-5): appropriate tone, no harmful content
"""

For the baseline judge pattern without CoT, see the LLM-as-a-judge production framework post.

How do you parse and store the rationales?

Parse with Pydantic, store alongside the score. The rationale is the audit trail for any score that disagrees with human review.

# filename: app/eval/schema.py
# description: Pydantic schema for CoT judge output.
from pydantic import BaseModel, Field


class DimensionScore(BaseModel):
    reason: str = Field(..., max_length=500)
    score: int = Field(..., ge=1, le=5)


class JudgeResult(BaseModel):
    correctness: DimensionScore
    grounding: DimensionScore
    completeness: DimensionScore
    clarity: DimensionScore
    safety: DimensionScore

    @property
    def mean(self) -> float:
        return sum([
            self.correctness.score,
            self.grounding.score,
            self.completeness.score,
            self.clarity.score,
            self.safety.score,
        ]) / 5

How does this improve variance?

In A/B tests on the same eval set with the same judge model:

Bare score prompt: variance across 3 runs = 0.45 (on a 1-5 scale)
CoT score prompt: variance across 3 runs = 0.12

That is 3.75x reduction in variance. Two consecutive runs now agree within 0.1-0.2 points on average, which means you can trust a 0.3-point change between runs as real signal instead of noise.

What do you audit with the rationales?

3 audit patterns that catch judge bugs:

Scan for the word "unclear". If the judge says "I cannot tell" or "unclear from the context," the rubric definition is ambiguous and needs tightening.
Flag rationales that do not mention the expected answer. If the correctness reason does not reference the expected answer, the judge is grading on vibes instead of comparison.
Compare rationales across runs. Same input should produce near-identical rationales. If the wording drifts heavily, the judge has a temperature or sampling bug.

For the full eval pipeline context including RAGAS integration, see the Automated evaluation pipelines for agentic systems post.

What to do Monday morning

If your current judge prompt asks for a bare score, add a chain-of-thought reasoning field. Replace {dim: score} with {dim: {reason, score}}.
Add evidence grounding to the reasoning instruction: "quote or paraphrase the specific part of the output that drives your grade."
Parse with a Pydantic schema. Store the full rationale alongside the score in your eval database.
Re-run the same eval set twice and compare variance. Expect a 3-4x reduction in score variance across runs.
Set up an audit dashboard that lets you filter and search rationales. A good one surfaces the 10 rationales you disagree with most per day.
Add a rule that any rationale under 30 characters is rejected and re-graded. Short rationales are a signal the judge is phoning it in.

LLM judges without explicit reasoning drift because they sample a score without committing to evidence. Variance across runs is typically 0.3-0.5 points on a 1-5 scale.
Chain-of-thought grading forces the judge to write evidence-based reasoning before each score. Variance drops to 0.1-0.2 points.
Each dimension gets its own reason+score pair. No global reasoning block. This prevents the model from blending scores together.
Parse with Pydantic, store rationales alongside scores. Rationales are the audit trail for every grade.
Audit by scanning for "unclear", filtering rationales that do not mention the expected answer, and comparing rationale wording across runs.
To see CoT grading wired into a production RAG evaluation pipeline with Ragas and dashboards, walk through the Agentic RAG Masterclass, or start with the RAG Fundamentals primer.

For the original chain-of-thought paper that CoT grading borrows from, see Wei et al. 2022. The same reasoning-before-answer pattern that improves task accuracy also improves evaluation consistency.

LLM judges: enforcing reasoning with explicit rationales

Share this post

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Retry patterns for LLM API errors in production

Weekly Bytes of AI

Ready to go deeper?

LLM judges: enforcing reasoning with explicit rationales

Share this post

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Retry patterns for LLM API errors in production

Weekly Bytes of AI

Ready to go deeper?