Langfuse integration for agentic AI tracing

Your agent answered wrong and you have no idea why

A user screenshots a bad answer and posts it in Slack. You ask which session. You get the ID. You grep production logs. You find 40 lines of JSON that look nothing like the actual conversation flow. You spend an hour reconstructing what happened from log timestamps, missing context, and guesses. You give up and ask the user to try again.

This is life without agent observability. Agent calls are nested (LLM inside tool call inside planner inside session), distributed (multiple workers), and expensive (every trace represents real money). print() statements and structured logs that work for CRUD services break down completely. You need traces, not logs.

Langfuse is the open-source observability layer built for this. A trace per request, a span per step, per-span input/output/cost/latency. 15 minutes to wire in. Saves hours on every incident. This post is the trace hierarchy, the decorator pattern that captures everything without touching your business logic, and the per-span fields that matter.

Why aren't logs enough for agent services?

Because logs are a flat stream and agent calls are a tree. A single agent request can fan out into 5 tool calls, 10 LLM invocations, 3 database queries, and a handful of internal function calls, all interleaved in the log output. Reconstructing the tree from a flat log stream is a manual puzzle.

3 specific things logs cannot give you:

Parent-child relationships. "Which LLM call was inside which tool call?" The log timestamps do not answer this; the call hierarchy does.
Per-step latency and cost. You want to know which single step took 8 seconds, not that the total request took 15. Logs give you totals; traces give you breakdowns.
Input and output pairs. For each LLM call, you want the prompt that went in and the response that came out, together. In logs, they are 2 separate lines and you have to join them by span ID.

Langfuse solves all 3. A trace has a tree of spans. Each span captures input, output, cost, latency, and metadata. The UI lets you click into any span and see the full context without reconstructing anything.

graph TD
    Trace[Trace: user request] --> S1[Span: planner]
    Trace --> S2[Span: retrieve]
    Trace --> S3[Span: generate]
    S1 --> L1[LLM call: plan]
    S2 --> T1[Tool: vector search]
    S2 --> L2[LLM call: rerank]
    S3 --> L3[LLM call: generate]

    style Trace fill:#dbeafe,stroke:#1e40af
    style L1 fill:#dcfce7,stroke:#15803d
    style L2 fill:#dcfce7,stroke:#15803d
    style L3 fill:#dcfce7,stroke:#15803d

One trace per user request. A handful of spans per trace. Nested hierarchy that matches the call structure. This is what you need to debug agent behavior.

How do you wire Langfuse into a Python agent service?

Install the SDK, initialize the client on startup, and wrap LLM and tool calls with the @observe decorator. The decorator automatically creates spans and captures inputs, outputs, and timings.

# filename: tracing.py
# description: Initialize Langfuse and decorate agent functions so every
# LLM call and tool call becomes a span in the trace.
from langfuse import Langfuse
from langfuse.decorators import observe

langfuse = Langfuse(
    public_key='pk-lf-...',
    secret_key='sk-lf-...',
    host='https://cloud.langfuse.com',
)


@observe(as_type='generation')
def call_llm(prompt: str, model: str = 'claude-sonnet-4-6') -> str:
    from anthropic import Anthropic
    client = Anthropic()
    reply = client.messages.create(
        model=model,
        max_tokens=1024,
        messages=[{'role': 'user', 'content': prompt}],
    )
    return reply.content[0].text


@observe()
def run_tool(name: str, args: dict) -> str:
    from tools import dispatch
    return dispatch(name, args)


@observe()
def agent_turn(user_message: str) -> str:
    plan = call_llm(f'Plan steps for: {user_message}')
    result = run_tool('run_bash', {'command': 'pytest'})
    answer = call_llm(f'User asked: {user_message}\nPlan: {plan}\nTool result: {result}\nAnswer:')
    return answer

The @observe(as_type='generation') decorator on call_llm marks the span as an LLM generation, which unlocks token counting and cost tracking in the Langfuse UI. Regular @observe() on the other functions creates plain spans for non-LLM steps.

Everything else in your code stays the same. No explicit span IDs, no parent tracking, no manual timing. The decorator stacks handle nesting automatically based on Python's call stack.

What fields should you log on each span?

4 fields on every span, plus a few LLM-specific ones on generation spans:

Input. The exact prompt, tool arguments, or function input. This is the single most useful field for debugging.
Output. What the function or LLM returned.
Metadata. Any extra context: tenant ID, user ID, model name, temperature, request ID. These become filter keys in the UI.
Error. If the call failed, the exception class and message. Failed spans are highlighted in the UI.

For generation spans (LLM calls), also log:

Model and parameters. model=claude-sonnet-4-6, temperature=0.7, max_tokens=1024.
Token counts. Input tokens, output tokens, cached tokens. Langfuse computes cost from these.
Usage. Any vendor-specific usage fields that help with billing reconciliation.

Most of these are captured automatically by the decorator when you use the official Langfuse SDK wrappers for OpenAI, Anthropic, and LangChain. For other providers, you log them manually inside the span.

How do you correlate a trace to a user session?

Pass a session ID (your internal conversation or thread ID) into the trace at the top level. Langfuse lets you filter traces by session ID and see every request in a conversation as a single grouped view.

# filename: session_tracing.py
# description: Attach session and user metadata at the top of each trace
# so the Langfuse UI can group and filter correctly.
from langfuse.decorators import observe, langfuse_context


@observe()
def handle_request(user_id: str, session_id: str, message: str) -> str:
    langfuse_context.update_current_trace(
        user_id=user_id,
        session_id=session_id,
        metadata={'tenant_id': resolve_tenant(user_id)},
    )
    return agent_turn(message)

The update_current_trace call adds fields to the outermost span. Langfuse indexes these and exposes them as filters in the UI. When a user reports a bad answer, you paste the session ID into the filter and see every trace for that conversation.

For the session and user model that provides the IDs, see the User and Session Models for Multi-Tenant AI Agents post. For the broader production stack picture, see System Design: Building a Production-Ready AI Chatbot.

What should you alert on?

3 alerts that catch real production issues:

Error rate above 2 percent of traces. Normal agent services have near-zero hard errors. A spike usually means a downstream provider outage or a bug in a new deploy.
Median trace latency above your SLA. Set the threshold based on your product's UX target. A 15-second median when the user expects 5 is a serious regression.
Per-user cost anomalies. If one user's daily trace count is 10x their historical average, it is either a legitimate heavy user or a runaway loop. Langfuse per-user dashboards catch this.

You can wire these into any alerting system (PagerDuty, Slack, custom webhooks) using Langfuse's API. Start with the first 2 and add the cost alert once you have a week of baseline data to compare against.

For the cost control side that pairs with observability, see the Rate Limiting FastAPI Agents: Token Buckets in Production post. Observability tells you when something went wrong; rate limiting prevents it from costing you $2000 before you notice.

How do you use traces for eval?

Pull production traces into your eval set. A trace already contains the question, the retrieved context, and the final answer, which is exactly what you need for LLM-based evaluation tools like RAGAS.

The pattern: filter traces by a tag like "needs_review" or by quality score, export them as JSON, and feed them to your eval pipeline. You can also use Langfuse's own annotation UI to label traces manually, then export the labeled set as training or evaluation data.

This is how production feedback loops back into quality improvement. Users flag bad answers in the UI, the bad answers are tagged in Langfuse, and the next eval run uses those as hard examples. For the RAGAS side of this loop, see the RAGAS Evaluation for RAG Pipelines: A Practical Guide post.

What to do Monday morning

Sign up for Langfuse (cloud or self-hosted) and grab the public and secret keys. 5 minutes.
Add langfuse to your requirements and initialize the client in your FastAPI lifespan block.
Decorate your top-level agent handler with @observe() and your LLM call function with @observe(as_type='generation'). Run a test request and confirm it shows up in the Langfuse UI.
Add update_current_trace(user_id=..., session_id=...) at the top of each request to enable session grouping. This is the single most useful filter in the UI.
Set up the 2 alerts: error rate above 2 percent and median latency above your SLA. Point them at Slack or your on-call rotation.

The headline: Langfuse is 15 minutes to wire in and hours saved on every production incident. Traces beat logs for agent services. Stop logging and start tracing.

Frequently asked questions

What is Langfuse?

Langfuse is an open-source observability platform built for LLM applications. It captures traces (per-request) and spans (per-step) with inputs, outputs, token counts, and costs. You can self-host it or use the managed cloud version. It integrates with OpenAI, Anthropic, LangChain, and LangGraph through decorators or wrappers, so adding it to an existing service is usually a few lines.

Why are traces better than logs for agent services?

Because agent calls are trees, not flat streams. A single request fans out into many nested LLM and tool calls, and logs cannot show the parent-child relationships without manual reconstruction. Traces capture the tree structure, the inputs and outputs together, and the per-step latency and cost, all of which you need to debug agent behavior efficiently.

How do I trace nested calls in a Python agent?

Decorate each function you want traced with @observe() and let the decorator handle nesting automatically based on the Python call stack. Langfuse keeps track of which decorated function called which, and builds the span hierarchy for you. For LLM calls specifically, use @observe(as_type='generation') to enable token and cost tracking.

What fields should I log on each span?

Input, output, metadata, and error for every span. For LLM spans, also log the model name, temperature, input and output token counts. Input and output are the most useful fields for debugging because they let you reproduce the exact call without re-running the pipeline. Metadata becomes filter keys in the UI.

How do I correlate traces to user sessions?

Pass a session ID at the top of each trace with update_current_trace(session_id=...). Langfuse indexes the session ID and exposes it as a filter in the UI. When a user reports a bad answer, you paste the session ID into the filter and see every trace in that conversation, which makes incident response dramatically faster.

Key takeaways

Agent calls are trees, not flat streams. Logs cannot show parent-child relationships; traces can. This is the one-line case for observability.
Langfuse captures traces, spans, inputs, outputs, token counts, and costs. Wire it in with 3 decorators and move on.
Use @observe(as_type='generation') on LLM calls to enable token and cost tracking. Use plain @observe() on tool and planning functions.
Attach user ID and session ID at the top of every trace. These become the filters you use to debug bad answers reported by specific users.
Alert on error rate and median latency. Pull flagged traces into your eval set to close the production-feedback loop.
To see Langfuse wired into a full production agent stack with auth, streaming, and cost control, walk through the Build Your Own Coding Agent course, or start with the AI Agents Fundamentals primer.

For the full Langfuse documentation, decorator API, and integration guides, see the official Langfuse docs. The OpenAI and Anthropic SDK wrappers there capture generation spans automatically with zero additional code.

Your agent answered wrong and you have no idea why

Why aren't logs enough for agent services?

3 specific things logs cannot give you:

Parent-child relationships. "Which LLM call was inside which tool call?" The log timestamps do not answer this; the call hierarchy does.
Per-step latency and cost. You want to know which single step took 8 seconds, not that the total request took 15. Logs give you totals; traces give you breakdowns.
Input and output pairs. For each LLM call, you want the prompt that went in and the response that came out, together. In logs, they are 2 separate lines and you have to join them by span ID.

graph TD
    Trace[Trace: user request] --> S1[Span: planner]
    Trace --> S2[Span: retrieve]
    Trace --> S3[Span: generate]
    S1 --> L1[LLM call: plan]
    S2 --> T1[Tool: vector search]
    S2 --> L2[LLM call: rerank]
    S3 --> L3[LLM call: generate]

    style Trace fill:#dbeafe,stroke:#1e40af
    style L1 fill:#dcfce7,stroke:#15803d
    style L2 fill:#dcfce7,stroke:#15803d
    style L3 fill:#dcfce7,stroke:#15803d

One trace per user request. A handful of spans per trace. Nested hierarchy that matches the call structure. This is what you need to debug agent behavior.

How do you wire Langfuse into a Python agent service?

Install the SDK, initialize the client on startup, and wrap LLM and tool calls with the @observe decorator. The decorator automatically creates spans and captures inputs, outputs, and timings.

# filename: tracing.py
# description: Initialize Langfuse and decorate agent functions so every
# LLM call and tool call becomes a span in the trace.
from langfuse import Langfuse
from langfuse.decorators import observe

langfuse = Langfuse(
    public_key='pk-lf-...',
    secret_key='sk-lf-...',
    host='https://cloud.langfuse.com',
)


@observe(as_type='generation')
def call_llm(prompt: str, model: str = 'claude-sonnet-4-6') -> str:
    from anthropic import Anthropic
    client = Anthropic()
    reply = client.messages.create(
        model=model,
        max_tokens=1024,
        messages=[{'role': 'user', 'content': prompt}],
    )
    return reply.content[0].text


@observe()
def run_tool(name: str, args: dict) -> str:
    from tools import dispatch
    return dispatch(name, args)


@observe()
def agent_turn(user_message: str) -> str:
    plan = call_llm(f'Plan steps for: {user_message}')
    result = run_tool('run_bash', {'command': 'pytest'})
    answer = call_llm(f'User asked: {user_message}\nPlan: {plan}\nTool result: {result}\nAnswer:')
    return answer

Everything else in your code stays the same. No explicit span IDs, no parent tracking, no manual timing. The decorator stacks handle nesting automatically based on Python's call stack.

What fields should you log on each span?

4 fields on every span, plus a few LLM-specific ones on generation spans:

Input. The exact prompt, tool arguments, or function input. This is the single most useful field for debugging.
Output. What the function or LLM returned.
Metadata. Any extra context: tenant ID, user ID, model name, temperature, request ID. These become filter keys in the UI.
Error. If the call failed, the exception class and message. Failed spans are highlighted in the UI.

For generation spans (LLM calls), also log:

Model and parameters. model=claude-sonnet-4-6, temperature=0.7, max_tokens=1024.
Token counts. Input tokens, output tokens, cached tokens. Langfuse computes cost from these.
Usage. Any vendor-specific usage fields that help with billing reconciliation.

How do you correlate a trace to a user session?

# filename: session_tracing.py
# description: Attach session and user metadata at the top of each trace
# so the Langfuse UI can group and filter correctly.
from langfuse.decorators import observe, langfuse_context


@observe()
def handle_request(user_id: str, session_id: str, message: str) -> str:
    langfuse_context.update_current_trace(
        user_id=user_id,
        session_id=session_id,
        metadata={'tenant_id': resolve_tenant(user_id)},
    )
    return agent_turn(message)

What should you alert on?

3 alerts that catch real production issues:

Error rate above 2 percent of traces. Normal agent services have near-zero hard errors. A spike usually means a downstream provider outage or a bug in a new deploy.
Median trace latency above your SLA. Set the threshold based on your product's UX target. A 15-second median when the user expects 5 is a serious regression.
Per-user cost anomalies. If one user's daily trace count is 10x their historical average, it is either a legitimate heavy user or a runaway loop. Langfuse per-user dashboards catch this.

How do you use traces for eval?

What to do Monday morning

Sign up for Langfuse (cloud or self-hosted) and grab the public and secret keys. 5 minutes.
Add langfuse to your requirements and initialize the client in your FastAPI lifespan block.
Decorate your top-level agent handler with @observe() and your LLM call function with @observe(as_type='generation'). Run a test request and confirm it shows up in the Langfuse UI.
Add update_current_trace(user_id=..., session_id=...) at the top of each request to enable session grouping. This is the single most useful filter in the UI.
Set up the 2 alerts: error rate above 2 percent and median latency above your SLA. Point them at Slack or your on-call rotation.

The headline: Langfuse is 15 minutes to wire in and hours saved on every production incident. Traces beat logs for agent services. Stop logging and start tracing.

Agent calls are trees, not flat streams. Logs cannot show parent-child relationships; traces can. This is the one-line case for observability.
Langfuse captures traces, spans, inputs, outputs, token counts, and costs. Wire it in with 3 decorators and move on.
Use @observe(as_type='generation') on LLM calls to enable token and cost tracking. Use plain @observe() on tool and planning functions.
Attach user ID and session ID at the top of every trace. These become the filters you use to debug bad answers reported by specific users.
Alert on error rate and median latency. Pull flagged traces into your eval set to close the production-feedback loop.
To see Langfuse wired into a full production agent stack with auth, streaming, and cost control, walk through the Build Your Own Coding Agent course, or start with the AI Agents Fundamentals primer.

Langfuse integration for agentic AI tracing

Share this post

Share this post

Continue Reading

Which language should you build Redis in? Lessons from rebuilding it 6 times

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Weekly Bytes of AI

Ready to go deeper?

Langfuse integration for agentic AI tracing

Share this post

Share this post

Continue Reading

Which language should you build Redis in? Lessons from rebuilding it 6 times

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Weekly Bytes of AI

Ready to go deeper?