Real-time agent debugging with Langfuse traces

The user reported a bug 3 minutes ago and you are still looking for the trace

Slack ping: "My chat just gave me a wrong answer." You open Langfuse. You search by the user ID, and filter by the last 10 minutes. 40 traces. Which one? You scroll through timestamps trying to match the user's description. 15 minutes later you find the trace, another 5 minutes to understand the failure, and 10 more to write the post-mortem. 30 minutes for one user complaint.

The fix is a real-time debugging workflow with specific Langfuse search patterns. Find any trace in 30 seconds by user ID + session ID + time window. Identify the failing step in 1 minute by filtering observations for errors. Write the root cause in 2 minutes with a templated post-mortem. 5 minutes total from report to resolution.

This post is the real-time debugging workflow for Langfuse: the search shortcuts, the failure taxonomy, the 5-minute incident playbook, and the post-mortem template that closes the loop with users and captures lessons for prevention.

Why is agent debugging slower than web app debugging?

Because every agent turn has a rich context (prompts, retrievals, tool calls, model outputs) and the failure mode is usually soft (wrong answer, not a stack trace). Traditional log-grep does not work because there is no obvious error to grep for. 3 specific failure modes of traditional debugging:

No stack trace for a wrong answer. The agent returned a response; nothing errored. The issue is semantic, not technical.
Too much context per request. A single trace has 10+ spans, each with its own prompt and output. Searching through them manually is slow.
No user-facing id. Users report "my last message" but there is no human-readable trace ID they can give you.

Langfuse fixes all 3 by giving you structured traces with full context, searchable by user + session + time, filterable by status, and linkable via URL.

graph TD
    User[User reports bad answer] --> Find[Find trace in 30s]
    Find --> Inspect[Inspect spans in 1min]
    Inspect --> Failure{Failure type?}
    Failure -->|retrieval miss| Fix1[Fix: chunking or retriever]
    Failure -->|wrong tool call| Fix2[Fix: tool schema or prompt]
    Failure -->|LLM hallucination| Fix3[Fix: grounding or prompt]
    Failure -->|timeout| Fix4[Fix: circuit breaker]

    style Find fill:#dbeafe,stroke:#1e40af
    style Failure fill:#fef3c7,stroke:#b45309

What is the 30-second search pattern?

Use Langfuse's filter URL parameters. Bookmark a "last 10 minutes for this user" template that takes the user ID and expands into a full search.

https://langfuse.yourservice.com/traces?
  userId=USER_ID
  &fromTimestamp=NOW-10MIN
  &toTimestamp=NOW
  &orderBy=start_time_desc

Paste the user ID from the support ticket, hit enter, see every trace for the last 10 minutes ranked by recency. The bad trace is almost always the first or second result.

For the Langfuse integration that feeds this data, see the Langfuse integration for agentic AI tracing post.

What is the 1-minute inspection pattern?

Open the trace. Click each span in order. Read 3 things per span:

Input prompt. Does it contain the right context?
Output. Does it match what the user saw?
Metadata. Duration, token count, model name. Any outlier?

The failure is usually in a single span. 90 percent of the time it is one of these 4:

Retrieval returned wrong chunks. Input to the LLM does not contain the answer. Fix: retriever tuning.
Tool call used wrong args. Tool ran but with bad arguments. Fix: tool schema or prompt.
LLM hallucinated. Correct context, wrong output. Fix: grounding, stricter prompt, or model swap.
Timeout or error. A span failed and the retry produced degraded output. Fix: circuit breaker or retry logic.

Recognize the pattern, jump to the fix. 1 minute per trace once you know the taxonomy.

What is the 5-minute incident playbook?

When a real incident fires, follow this sequence. No improvisation, no detours.

Minute 0: Open Langfuse with the user's ID pre-filled.
Minute 1: Identify the failing trace by timestamp and user description.
Minute 2: Walk the spans. Identify the failing step (retrieval, tool, LLM, timeout).
Minute 3: Write a one-sentence root cause in the incident channel.
Minute 4: Decide: hotfix, rollback, or document-and-continue.
Minute 5: If hotfix or rollback, trigger it. If document-and-continue, leave a note for the next standup.

The key rule: do NOT jump into the code until you have identified the failing span. Debugging blind wastes 20 minutes every time.

For the broader observability stack that makes this possible, see the Langfuse + Grafana agentic AI monitoring post.

What does the post-mortem template look like?

Keep it tight. 6 fields. Every incident uses the same format.

## Incident: [one-line summary]

**When:** [timestamp]
**Duration:** [user reported → resolved]
**Affected:** [number of users]

**What happened:**
[3 sentences describing the user-visible behavior]

**Root cause (from trace):**
[The specific span that failed and why. Include Langfuse trace link.]

**Fix:**
[What we did to resolve it, or what we plan to do. Link to PR if applicable.]

**Prevention:**
[One rule added to the rubric, one alert added, one test added. Pick at least one.]

Save every post-mortem in a shared folder. Review the prevention lessons every quarter. Most recurring incidents have prevention lessons from 3 months earlier that nobody implemented.

What to do Monday morning

Bookmark the Langfuse "last 10 minutes for user" URL template. Save it in your team's runbook.
Memorize the 4-failure taxonomy: retrieval miss, wrong tool args, hallucination, timeout. Map each to a specific fix.
Run the 5-minute playbook on the next incident. Time yourself. If it takes longer than 5 minutes, figure out which step was slow and add a shortcut.
Copy the post-mortem template into your incident channel bot. Every report should auto-populate with the template.
Add a quarterly review of the prevention lessons across all post-mortems. Count how many were already lessons from earlier incidents. That is your prevention-debt number.

The headline: real-time agent debugging is a 5-minute workflow with 4 shortcuts. Langfuse search → span walk → failure classification → fix → post-mortem. Practice it on low-stakes incidents so it is muscle memory by the time a real one fires.

Frequently asked questions

Why is agent debugging harder than web app debugging?

Because agent failures are semantic, not technical. A wrong answer returned successfully has no stack trace to grep. Each trace has 10+ spans with rich context that takes minutes to read through. Langfuse gives you the structured trace view and search shortcuts that make semantic debugging tractable.

How do I find a trace by user ID quickly?

Bookmark a Langfuse URL template with ?userId=...&fromTimestamp=NOW-10MIN&orderBy=start_time_desc. Paste the user ID from the support ticket, hit enter, and see the most recent traces for that user. The bad trace is almost always the first or second result.

What are the 4 most common agent failure types?

Retrieval miss (wrong chunks returned), wrong tool call arguments, LLM hallucination (right context, wrong output), and timeout or error. These cover 90 percent of production failures. Learn to recognize each one in a trace and jump straight to the matching fix.

How long should incident triage take?

5 minutes from report to root cause identification. Under the playbook: 1 minute to find the trace, 1 minute to walk the spans, 1 minute to classify the failure, 1 minute to write the root cause, 1 minute to decide the fix path. Any longer means you are missing a shortcut or improvising.

What goes in a post-mortem?

Six fields: summary, timestamp, duration, affected users, what happened, root cause (with trace link), fix, and prevention. The prevention field is the most important and most often skipped. Write one concrete action: a rubric rule, an alert, or a test. Review all post-mortems quarterly for unimplemented lessons.

Key takeaways

Agent debugging is slow without structured traces because failures are semantic, not technical. Langfuse gives you the structured view that makes semantic debugging tractable.
Bookmark a Langfuse URL template for "last 10 minutes for user". 30 seconds to find any trace from a support report.
Walk the spans in order. 90 percent of failures fall into 4 types: retrieval miss, wrong tool args, hallucination, timeout. Learn the taxonomy.
Follow the 5-minute playbook: find trace, walk spans, classify, root cause, fix decision. No improvisation.
Post-mortem template with 6 fields. The prevention field is load-bearing: one rubric rule, alert, or test per incident.
To see Langfuse debugging wired into a full production agent stack with evaluation and cost tracking, walk through the Agentic RAG Masterclass, or start with the AI Agents Fundamentals primer.

For the Langfuse documentation covering trace search, filtering, and integrations, see the Langfuse docs.

The user reported a bug 3 minutes ago and you are still looking for the trace

Why is agent debugging slower than web app debugging?

No stack trace for a wrong answer. The agent returned a response; nothing errored. The issue is semantic, not technical.
Too much context per request. A single trace has 10+ spans, each with its own prompt and output. Searching through them manually is slow.
No user-facing id. Users report "my last message" but there is no human-readable trace ID they can give you.

Langfuse fixes all 3 by giving you structured traces with full context, searchable by user + session + time, filterable by status, and linkable via URL.

graph TD
    User[User reports bad answer] --> Find[Find trace in 30s]
    Find --> Inspect[Inspect spans in 1min]
    Inspect --> Failure{Failure type?}
    Failure -->|retrieval miss| Fix1[Fix: chunking or retriever]
    Failure -->|wrong tool call| Fix2[Fix: tool schema or prompt]
    Failure -->|LLM hallucination| Fix3[Fix: grounding or prompt]
    Failure -->|timeout| Fix4[Fix: circuit breaker]

    style Find fill:#dbeafe,stroke:#1e40af
    style Failure fill:#fef3c7,stroke:#b45309

What is the 30-second search pattern?

Use Langfuse's filter URL parameters. Bookmark a "last 10 minutes for this user" template that takes the user ID and expands into a full search.

https://langfuse.yourservice.com/traces?
  userId=USER_ID
  &fromTimestamp=NOW-10MIN
  &toTimestamp=NOW
  &orderBy=start_time_desc

Paste the user ID from the support ticket, hit enter, see every trace for the last 10 minutes ranked by recency. The bad trace is almost always the first or second result.

For the Langfuse integration that feeds this data, see the Langfuse integration for agentic AI tracing post.

What is the 1-minute inspection pattern?

Open the trace. Click each span in order. Read 3 things per span:

Input prompt. Does it contain the right context?
Output. Does it match what the user saw?
Metadata. Duration, token count, model name. Any outlier?

The failure is usually in a single span. 90 percent of the time it is one of these 4:

Retrieval returned wrong chunks. Input to the LLM does not contain the answer. Fix: retriever tuning.
Tool call used wrong args. Tool ran but with bad arguments. Fix: tool schema or prompt.
LLM hallucinated. Correct context, wrong output. Fix: grounding, stricter prompt, or model swap.
Timeout or error. A span failed and the retry produced degraded output. Fix: circuit breaker or retry logic.

Recognize the pattern, jump to the fix. 1 minute per trace once you know the taxonomy.

What is the 5-minute incident playbook?

When a real incident fires, follow this sequence. No improvisation, no detours.

Minute 0: Open Langfuse with the user's ID pre-filled.
Minute 1: Identify the failing trace by timestamp and user description.
Minute 2: Walk the spans. Identify the failing step (retrieval, tool, LLM, timeout).
Minute 3: Write a one-sentence root cause in the incident channel.
Minute 4: Decide: hotfix, rollback, or document-and-continue.
Minute 5: If hotfix or rollback, trigger it. If document-and-continue, leave a note for the next standup.

The key rule: do NOT jump into the code until you have identified the failing span. Debugging blind wastes 20 minutes every time.

For the broader observability stack that makes this possible, see the Langfuse + Grafana agentic AI monitoring post.

What does the post-mortem template look like?

Keep it tight. 6 fields. Every incident uses the same format.

## Incident: [one-line summary]

**When:** [timestamp]
**Duration:** [user reported → resolved]
**Affected:** [number of users]

**What happened:**
[3 sentences describing the user-visible behavior]

**Root cause (from trace):**
[The specific span that failed and why. Include Langfuse trace link.]

**Fix:**
[What we did to resolve it, or what we plan to do. Link to PR if applicable.]

**Prevention:**
[One rule added to the rubric, one alert added, one test added. Pick at least one.]

Save every post-mortem in a shared folder. Review the prevention lessons every quarter. Most recurring incidents have prevention lessons from 3 months earlier that nobody implemented.

What to do Monday morning

Bookmark the Langfuse "last 10 minutes for user" URL template. Save it in your team's runbook.
Memorize the 4-failure taxonomy: retrieval miss, wrong tool args, hallucination, timeout. Map each to a specific fix.
Run the 5-minute playbook on the next incident. Time yourself. If it takes longer than 5 minutes, figure out which step was slow and add a shortcut.
Copy the post-mortem template into your incident channel bot. Every report should auto-populate with the template.
Add a quarterly review of the prevention lessons across all post-mortems. Count how many were already lessons from earlier incidents. That is your prevention-debt number.

Agent debugging is slow without structured traces because failures are semantic, not technical. Langfuse gives you the structured view that makes semantic debugging tractable.
Bookmark a Langfuse URL template for "last 10 minutes for user". 30 seconds to find any trace from a support report.
Walk the spans in order. 90 percent of failures fall into 4 types: retrieval miss, wrong tool args, hallucination, timeout. Learn the taxonomy.
Follow the 5-minute playbook: find trace, walk spans, classify, root cause, fix decision. No improvisation.
Post-mortem template with 6 fields. The prevention field is load-bearing: one rubric rule, alert, or test per incident.
To see Langfuse debugging wired into a full production agent stack with evaluation and cost tracking, walk through the Agentic RAG Masterclass, or start with the AI Agents Fundamentals primer.

For the Langfuse documentation covering trace search, filtering, and integrations, see the Langfuse docs.

Real-time agent debugging with Langfuse traces

Share this post

Share this post

Continue Reading

Which language should you build Redis in? Lessons from rebuilding it 6 times

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Weekly Bytes of AI

Ready to go deeper?

Real-time agent debugging with Langfuse traces

Share this post

Share this post

Continue Reading

Which language should you build Redis in? Lessons from rebuilding it 6 times

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Weekly Bytes of AI

Ready to go deeper?