JSON output parsing for RAG: Pydantic grounding

Your RAG answer looked right and was still wrong

You retrieve 5 chunks. You stuff them into a prompt that says "answer the question using only this context." You have no JSON output parser, no Pydantic schema, no grounding enforcement, just freeform prose. The model produces a fluent, confident answer. You ship it. A user catches a fact in the answer that is not in any of the chunks. The model invented it. The phrase "using only this context" did not stop it.

Every team that ships RAG hits this. The phrase-level fix (more emphatic prompts, "do not hallucinate," "stick to the context") never works for long. The structural fix is to stop accepting freeform prose as the answer. Instead, force the model to return a JSON object with explicit fields for the answer, the source quotes, and the confidence. Validate the JSON. Reject any answer whose quotes are not present verbatim in the retrieved context.

This post is the structured-output pattern that grounds RAG answers in the only thing the model is allowed to use: the retrieved text. By the end you will have a Pydantic schema, a prompt template, and a validation function that together close most of the hallucination surface in a typical RAG app.

Why does freeform output lead to hallucinations?

Because freeform text gives the model permission to make things up that "sound right." When the only constraint is "answer in plain English," the model interpolates between the context and its prior knowledge. It does this without telling you. The result is often a sentence that mostly comes from the chunks plus one phrase from the model's training data that sounds plausible but is wrong.

graph TD
    Q[Question] --> R[Retriever]
    R -->|chunks| Naive[Naive RAG: stuff prompt]
    Naive -->|freeform answer| Mix[Mix of context and prior knowledge]
    Mix --> Bad[Confident wrong answer]

    R -->|chunks| Struct[Structured RAG: JSON schema]
    Struct -->|JSON with quotes| Validate[Validate quotes against chunks]
    Validate -->|all quotes verbatim?| Pass[Return answer]
    Validate -->|missing quote| Reject[Reject and retry or say I do not know]

    style Bad fill:#fee2e2,stroke:#b91c1c
    style Pass fill:#dcfce7,stroke:#15803d
    style Reject fill:#fef3c7,stroke:#b45309

The structural fix forces the model to commit. Every claim in the answer has to come with a source quote. Every source quote has to exist verbatim in the retrieved context. If the model wants to say something that is not in the chunks, it cannot, because it cannot produce a quote that does not exist. This is grounding by construction, not by polite request.

What does the Pydantic schema look like?

The schema is small and opinionated. 3 fields: the answer, the supporting quotes, and a confidence flag. The quotes field is the load-bearing one.

# filename: schemas.py
# description: The schema every RAG answer must conform to. The quotes
# field is what makes grounding enforceable.
from pydantic import BaseModel, Field


class GroundedAnswer(BaseModel):
    answer: str = Field(
        description='A direct answer to the question, using only the supporting quotes.'
    )
    supporting_quotes: list[str] = Field(
        description='Verbatim sentences from the context that justify the answer. Must be substrings of the provided context.'
    )
    can_answer: bool = Field(
        description='True if the context contains enough information to answer the question.'
    )

3 fields, 3 jobs. answer is for the user. supporting_quotes is for the validator. can_answer is the explicit "I do not know" channel. The schema gives the model a structured place to surface refusals instead of inventing an answer to be helpful.

The descriptions matter. When you pass this schema to the model (through tool use, JSON mode, or instructor), the field descriptions become part of the prompt. "Must be substrings of the provided context" is the rule the validator will enforce. Telling the model up front improves first-attempt accuracy.

How do you force the model to return this schema?

3 options, in order of how much I trust them in production.

The first option is the model's native structured output API. Anthropic's tool use, OpenAI's response_format={'type': 'json_schema'}, and Gemini's structured output all guarantee the response will conform to your schema at the API level. No regex parsing, no malformed JSON, no retries. Use this when your provider supports it. It is the cleanest and most reliable option.

# filename: grounded_answer.py
# description: Use Anthropic tool use to force the model to return a
# GroundedAnswer object. The schema is the contract.
import json
from anthropic import Anthropic
from schemas import GroundedAnswer

client = Anthropic()

ANSWER_TOOL = {
    'name': 'submit_answer',
    'description': 'Submit a grounded answer to the user question.',
    'input_schema': GroundedAnswer.model_json_schema(),
}


def grounded_answer(question: str, context: str) -> GroundedAnswer:
    reply = client.messages.create(
        model='claude-sonnet-4-6',
        max_tokens=1024,
        tools=[ANSWER_TOOL],
        tool_choice={'type': 'tool', 'name': 'submit_answer'},
        messages=[{
            'role': 'user',
            'content': (
                f'Answer the question using only this context. '
                f'Every claim in your answer must be supported by a verbatim quote.\n\n'
                f'Context:\n{context}\n\nQuestion: {question}'
            ),
        }],
    )
    tool_input = next(b.input for b in reply.content if b.type == 'tool_use')
    return GroundedAnswer.model_validate(tool_input)

The tool_choice argument forces the model to call the tool, which means it must produce a valid GroundedAnswer. There is no path where it returns prose instead. This is the entire grounding contract in one API call.

The second option is the LangChain JsonOutputParser paired with a Pydantic schema. It works by injecting format instructions into the prompt and then parsing the response. It is more portable across providers but more failure-prone because the model can still produce malformed JSON. Use it when you need provider neutrality.

The third option is regex. Do not use regex. People who use regex to parse LLM output end up writing a JSON parser inside their app. Use one of the first 2 options.

For a deeper walk-through of structured output patterns inside a production RAG pipeline, the Agentic RAG Masterclass covers grounding alongside reranking and self-correction. The free RAG Fundamentals primer is the right starting point if you are still building your first retrieval system.

How do you validate that the quotes are actually in the context?

The schema is necessary but not sufficient. The model can still produce JSON with quotes that are slightly wrong, paraphrased, or hallucinated. The validation step is what closes the loop.

# filename: validate.py
# description: Reject any answer whose supporting quotes are not present
# verbatim in the retrieved context. This is the grounding enforcement.
from schemas import GroundedAnswer


def validate_grounding(answer: GroundedAnswer, context: str) -> tuple[bool, list[str]]:
    missing = [q for q in answer.supporting_quotes if q not in context]
    return (len(missing) == 0, missing)


def safe_answer(question: str, context: str) -> str:
    answer = grounded_answer(question, context)

    if not answer.can_answer:
        return 'I could not find this in the documents.'

    grounded, missing = validate_grounding(answer, context)
    if not grounded:
        # The model hallucinated quotes. Refuse the answer.
        return f'Refusing answer: {len(missing)} unsupported quote(s).'

    return answer.answer

3 branches, 3 different failure modes handled. The model says it cannot answer (can_answer == False): respect that and tell the user. The model returns quotes that are not in the context: refuse the answer because the grounding contract was broken. Both checks pass: return the answer with confidence.

The substring check is intentionally strict. A paraphrased quote fails. A quote with a typo fails. This sounds harsh but it is the entire point. If you allow fuzzy matching, the model learns it can drift slightly and you are back to soft grounding. Strict substring matching is what makes the contract enforceable.

In practice, the model gets it right 90+ percent of the time on the first try when the schema and prompt are well written. The other 10 percent is your retriever or your prompt being wrong, not the model being malicious. Use the failure rate as a signal.

What should you do when validation fails?

3 options, in order of safety.

Option one: surface the failure to the user as "I could not find this in the documents." This is the safest and the one I default to. It builds user trust because the system admits ignorance instead of inventing.

Option 2: retry once with a sterner system prompt. "Your previous answer included quotes that were not in the context. Try again, and do not invent quotes." If the second attempt grounds correctly, return it. If not, fall back to option one.

Option 3: log the failure with the original question, retrieved context, and rejected answer. Use these logs to improve the retriever. Most of the time, validation failures point at retrieval gaps, not model failures. The model wanted to answer with information that was not retrieved. Fixing the retriever fixes the validation failure rate.

For the broader picture of how structured output and grounding fit into a production RAG service, see the System Design: Building a Production-Ready AI Chatbot post, which shows where the validation step lives relative to streaming and persistence.

What to do Monday morning

Define a GroundedAnswer Pydantic model in your RAG pipeline. 3 fields: answer, supporting_quotes, can_answer. Copy the one above.
Switch your answer endpoint to use the model's native structured output API (tool use for Anthropic, response_format for OpenAI, structured output for Gemini). Force the model to return the schema.
Add the substring validation step. Reject answers whose quotes are not verbatim in the retrieved context.
Replace your generic "I do not know" fallback with the explicit can_answer == False branch. The model now has a structured way to refuse.
Log every validation failure. Look at the rejected quotes weekly. Most of them will tell you that your retriever missed something, not that your model misbehaved.

The headline: stop trusting prose. JSON plus a Pydantic schema plus a substring check turns "please do not hallucinate" from a wish into a contract.

Frequently asked questions

What is JSON output parsing in LangChain?

It is a parser that takes the raw text from an LLM and converts it into a Python dict or a Pydantic model. The most common variant pairs the parser with a schema, injects format instructions into the prompt, and validates the response. It works across providers but can fail when the model returns malformed JSON. Native structured output APIs are more reliable when your provider supports them.

How do you force an LLM to return JSON?

The cleanest way is to use the provider's native structured output API. Anthropic's tool use with tool_choice set to a specific tool, OpenAI's response_format={'type': 'json_schema'}, and Gemini's structured output all guarantee the response conforms to a schema you provide. These APIs do not return prose, only structured objects. Prompt-only approaches work but are not as reliable in production.

How does Pydantic help with LLM output validation?

Pydantic gives you a typed schema with field descriptions and validation logic. You define the shape of the output once, the model receives the schema as context, and you call model_validate on the response to enforce the contract. Field descriptions become part of the prompt, so well-written ones improve first-attempt accuracy. Validation errors become exceptions you can handle deterministically.

How does structured output reduce RAG hallucinations?

By giving the model an explicit place to say "I cannot answer" and an explicit place to cite supporting quotes. The validator then checks that each quote exists verbatim in the retrieved context. If a quote is missing or paraphrased, the answer is rejected. The model cannot produce a hallucinated claim that survives validation, because the claim has to be tied to a substring of the input.

What should I do when an answer fails grounding validation?

3 options. Surface "I could not find this in the documents" to the user (safest). Retry once with a sterner prompt that flags the previous failure. Log the rejected quotes and feed them back into retriever evaluation, since most validation failures point at missing context rather than model misbehavior. Pick the option that matches your tolerance for false negatives versus user-visible refusals.

Key takeaways

Freeform RAG output silently mixes retrieved context with the model's prior knowledge. Structured output forces the model to commit to a specific shape and source.
Use a small Pydantic schema with 3 fields: answer, supporting_quotes, can_answer. The quotes field is what makes grounding enforceable.
Force the schema with the provider's native structured output API (tool use, response_format, etc.) instead of relying on prompt-level instructions.
Validate that every quote in the response exists verbatim in the retrieved context. Strict substring matching is the point. Fuzzy matching defeats the contract.
Most validation failures are retriever bugs in disguise. Log them and improve retrieval before blaming the model.
To see grounding wired into a complete agentic RAG pipeline, walk through the Agentic RAG Masterclass, or start with the RAG Fundamentals primer.

For the official structured output reference, see the Anthropic tool use guide. The pattern in this post uses tool use as the structured output mechanism, which is the most reliable way to get a typed object back from Claude in production.

Your RAG answer looked right and was still wrong

Why does freeform output lead to hallucinations?

graph TD
    Q[Question] --> R[Retriever]
    R -->|chunks| Naive[Naive RAG: stuff prompt]
    Naive -->|freeform answer| Mix[Mix of context and prior knowledge]
    Mix --> Bad[Confident wrong answer]

    R -->|chunks| Struct[Structured RAG: JSON schema]
    Struct -->|JSON with quotes| Validate[Validate quotes against chunks]
    Validate -->|all quotes verbatim?| Pass[Return answer]
    Validate -->|missing quote| Reject[Reject and retry or say I do not know]

    style Bad fill:#fee2e2,stroke:#b91c1c
    style Pass fill:#dcfce7,stroke:#15803d
    style Reject fill:#fef3c7,stroke:#b45309

What does the Pydantic schema look like?

The schema is small and opinionated. 3 fields: the answer, the supporting quotes, and a confidence flag. The quotes field is the load-bearing one.

# filename: schemas.py
# description: The schema every RAG answer must conform to. The quotes
# field is what makes grounding enforceable.
from pydantic import BaseModel, Field


class GroundedAnswer(BaseModel):
    answer: str = Field(
        description='A direct answer to the question, using only the supporting quotes.'
    )
    supporting_quotes: list[str] = Field(
        description='Verbatim sentences from the context that justify the answer. Must be substrings of the provided context.'
    )
    can_answer: bool = Field(
        description='True if the context contains enough information to answer the question.'
    )

How do you force the model to return this schema?

3 options, in order of how much I trust them in production.

# filename: grounded_answer.py
# description: Use Anthropic tool use to force the model to return a
# GroundedAnswer object. The schema is the contract.
import json
from anthropic import Anthropic
from schemas import GroundedAnswer

client = Anthropic()

ANSWER_TOOL = {
    'name': 'submit_answer',
    'description': 'Submit a grounded answer to the user question.',
    'input_schema': GroundedAnswer.model_json_schema(),
}


def grounded_answer(question: str, context: str) -> GroundedAnswer:
    reply = client.messages.create(
        model='claude-sonnet-4-6',
        max_tokens=1024,
        tools=[ANSWER_TOOL],
        tool_choice={'type': 'tool', 'name': 'submit_answer'},
        messages=[{
            'role': 'user',
            'content': (
                f'Answer the question using only this context. '
                f'Every claim in your answer must be supported by a verbatim quote.\n\n'
                f'Context:\n{context}\n\nQuestion: {question}'
            ),
        }],
    )
    tool_input = next(b.input for b in reply.content if b.type == 'tool_use')
    return GroundedAnswer.model_validate(tool_input)

The third option is regex. Do not use regex. People who use regex to parse LLM output end up writing a JSON parser inside their app. Use one of the first 2 options.

How do you validate that the quotes are actually in the context?

The schema is necessary but not sufficient. The model can still produce JSON with quotes that are slightly wrong, paraphrased, or hallucinated. The validation step is what closes the loop.

# filename: validate.py
# description: Reject any answer whose supporting quotes are not present
# verbatim in the retrieved context. This is the grounding enforcement.
from schemas import GroundedAnswer


def validate_grounding(answer: GroundedAnswer, context: str) -> tuple[bool, list[str]]:
    missing = [q for q in answer.supporting_quotes if q not in context]
    return (len(missing) == 0, missing)


def safe_answer(question: str, context: str) -> str:
    answer = grounded_answer(question, context)

    if not answer.can_answer:
        return 'I could not find this in the documents.'

    grounded, missing = validate_grounding(answer, context)
    if not grounded:
        # The model hallucinated quotes. Refuse the answer.
        return f'Refusing answer: {len(missing)} unsupported quote(s).'

    return answer.answer

What should you do when validation fails?

3 options, in order of safety.

What to do Monday morning

Define a GroundedAnswer Pydantic model in your RAG pipeline. 3 fields: answer, supporting_quotes, can_answer. Copy the one above.
Switch your answer endpoint to use the model's native structured output API (tool use for Anthropic, response_format for OpenAI, structured output for Gemini). Force the model to return the schema.
Add the substring validation step. Reject answers whose quotes are not verbatim in the retrieved context.
Replace your generic "I do not know" fallback with the explicit can_answer == False branch. The model now has a structured way to refuse.
Log every validation failure. Look at the rejected quotes weekly. Most of them will tell you that your retriever missed something, not that your model misbehaved.

The headline: stop trusting prose. JSON plus a Pydantic schema plus a substring check turns "please do not hallucinate" from a wish into a contract.

Freeform RAG output silently mixes retrieved context with the model's prior knowledge. Structured output forces the model to commit to a specific shape and source.
Use a small Pydantic schema with 3 fields: answer, supporting_quotes, can_answer. The quotes field is what makes grounding enforceable.
Force the schema with the provider's native structured output API (tool use, response_format, etc.) instead of relying on prompt-level instructions.
Validate that every quote in the response exists verbatim in the retrieved context. Strict substring matching is the point. Fuzzy matching defeats the contract.
Most validation failures are retriever bugs in disguise. Log them and improve retrieval before blaming the model.
To see grounding wired into a complete agentic RAG pipeline, walk through the Agentic RAG Masterclass, or start with the RAG Fundamentals primer.

JSON output parsing for RAG: grounding with Pydantic

Share this post

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

Choosing the LLM judge for evaluation pipelines

Ground truth vs relevancy in RAG evaluation

Weekly Bytes of AI

Ready to go deeper?

JSON output parsing for RAG: grounding with Pydantic

Share this post

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

Choosing the LLM judge for evaluation pipelines

Ground truth vs relevancy in RAG evaluation

Weekly Bytes of AI

Ready to go deeper?