Langfuse + Grafana: agentic AI monitoring
You have Langfuse for traces, Prometheus for metrics, and no single view
You wired Langfuse into your agent and can now debug any individual trace. You also added Prometheus histograms for request rate and latency. Both are great in isolation. When a user complains at 2pm, you open Langfuse to find the trace, switch to Grafana to see the latency trend, then back to Langfuse to check the prompt, then to your email to find the user's query. 4 tabs, 20 clicks, and the context never fits on one screen.
The fix is to merge Langfuse trace data into Grafana as an additional data source. Traces become a panel in your main dashboard. Click a slow request in the latency chart, jump directly to the Langfuse trace. One tab, one workflow, one minute to root-cause.
This post is the Langfuse + Grafana integration pattern for agentic AI monitoring: the data source setup, the dashboard panels, the alert rules that fire before users notice, and the workflow that cuts debugging time from 20 minutes to 2.
Why split monitoring across Langfuse and Prometheus at all?
Because each tool is good at one thing. Langfuse is built for LLM-specific traces: prompts, completions, token counts, model versions, tool calls. Prometheus is built for numeric time series: latency histograms, request counts, error rates, resource usage. Using one for the other's job produces bad results.
3 specific failure modes of single-tool observability:
- Prometheus-only: you see p95 latency spike but have no way to drill into which prompt, which model, or which tool caused it.
- Langfuse-only: you see individual bad traces but no trend, no alerting, and no way to segment by dimension at scale.
- Tabs-only: you have both tools open in separate tabs with no correlation, and every debug session wastes 15 minutes switching contexts.
The Langfuse + Grafana integration fixes the tab-switching by making Langfuse trace links clickable from Grafana panels.
graph LR
Agent[Agent service] --> Prom[(Prometheus)]
Agent --> LF[(Langfuse)]
Prom --> Panel1[Grafana: latency chart]
LF --> Panel2[Grafana: trace table panel]
LF --> Link[Clickable trace links]
Panel1 & Panel2 & Link --> Dashboard[Unified Grafana dashboard]
style Dashboard fill:#dcfce7,stroke:#15803d
How do you add Langfuse as a Grafana data source?
Langfuse exposes a Postgres database under the hood. Add it as a Grafana PostgreSQL data source and query trace metadata directly.
# filename: grafana-datasources.yml
# description: Grafana data source config for Langfuse Postgres.
apiVersion: 1
datasources:
- name: Langfuse
type: postgres
url: langfuse-postgres:5432
database: langfuse
user: grafana_reader
secureJsonData:
password: ${LANGFUSE_DB_PASSWORD}
jsonData:
sslmode: 'require'
postgresVersion: 1500
Use a read-only user for the Grafana connection. Never grant write access to the dashboarding tool.
For the Langfuse integration itself, see the Langfuse integration for agentic AI tracing post.
What PromQL + Langfuse queries do you run together?
3 queries that combine both data sources in a single dashboard.
Query 1: Langfuse-sourced latency heatmap by model
-- Grafana panel: Langfuse heatmap by model, last 1h
SELECT
$__timeGroupAlias("start_time", $__interval),
model,
AVG(latency_ms) as avg_latency
FROM observations
WHERE $__timeFilter("start_time")
GROUP BY 1, 2
ORDER BY 1
This gives you a heatmap showing which models are slow in which time windows. Much richer than a Prometheus histogram because it includes the model name directly from Langfuse.
Query 2: Prometheus error rate with trace deep-links
sum(rate(agent_turn_duration_seconds_count{status="error"}[5m])) by (endpoint)
Add a Grafana "data link" on this panel that jumps to a Langfuse trace search filtered by the same time window and status. Click any error spike, land on the failing traces.
Query 3: Cost per turn from Langfuse
-- Grafana panel: cost per turn over time
SELECT
$__timeGroupAlias("start_time", $__interval),
AVG(total_cost) as avg_cost_per_trace
FROM traces
WHERE $__timeFilter("start_time")
GROUP BY 1
Langfuse tracks costs per trace in its database. Querying this via Grafana gives you a cost trend line next to your latency and error rate panels.
For the cost optimization deep-dive using trace data, see the Agent cost optimization from trace data post.
What panels should the unified dashboard have?
6 panels in a 2x3 grid:
- Request rate and error rate (Prometheus, line chart)
- p50 and p95 agent turn latency (Prometheus, line chart)
- Latency heatmap by model (Langfuse SQL, heatmap)
- Cost per turn trend (Langfuse SQL, line chart)
- Top 10 slowest traces right now (Langfuse SQL, table with trace ID links)
- Error rate by tool (Prometheus, bar chart)
Each panel answers a specific question. Together they cover the full agent health picture.
What alert rules do you set?
4 alerts that combine both data sources effectively.
- Latency spike (Prometheus): p95 > 15s for 10 minutes → page on-call.
- Cost spike (Langfuse SQL via Grafana alerting): avg cost per trace > 2x yesterday's mean for 30 minutes → Slack.
- Model-specific error rate (Langfuse SQL): error rate for any model > 5 percent for 10 minutes → Slack + Langfuse trace link.
- Silent failure (Prometheus): request rate drops to 0 for 10 minutes → page (catches the case where the service is accepting no traffic at all).
Each alert includes a link to the relevant Langfuse trace search so the on-call engineer lands on the failing traces within 30 seconds.
What to do Monday morning
- Add Langfuse's Postgres instance as a read-only Grafana data source. Use a dedicated
grafana_readerrole. - Build the 6-panel unified dashboard. Mix Prometheus and Langfuse queries in the same view.
- Add Grafana data links on error rate and latency panels that jump to filtered Langfuse trace searches.
- Set the 4 alert rules. Include the Langfuse trace search URL in each alert annotation so on-call has instant context.
- Time a real debug session: from the moment a user reports a problem to the moment you find the root cause. If it takes longer than 5 minutes, your dashboard is missing a panel.
- Bookmark the unified dashboard as your team's on-call start page.
The headline: Langfuse + Grafana in one dashboard cuts debug time by 10x. 6 panels, 4 alerts, one tab. Stop switching between trace viewer and metrics dashboard.
Frequently asked questions
Why use both Langfuse and Prometheus?
Because each tool is optimized for a different layer. Langfuse captures LLM-specific data (prompts, completions, tokens, costs, model versions). Prometheus captures numeric time series (latency, error rate, resource usage). Langfuse is weak at aggregation and alerting; Prometheus is weak at LLM-specific context. Combine them for the strengths of both.
How do I add Langfuse as a Grafana data source?
Langfuse runs on Postgres under the hood. Add it as a Postgres data source in Grafana, pointing at the Langfuse database with a read-only user. You can then write SQL queries against the traces and observations tables to build panels. Never give Grafana write access to the Langfuse database.
What Langfuse tables should I query?
Two main ones: traces (one row per end-to-end request with total cost, duration, status) and observations (one row per LLM call, tool call, or span inside a trace, with per-call latency and tokens). traces is for request-level aggregation; observations is for drilling into individual LLM calls or tool invocations.
How do I create a clickable trace link from a Grafana panel?
Use Grafana's "Data Links" feature on a time series or table panel. Set the URL template to https://langfuse.yourservice.com/traces?from=${__from}&to=${__to}&status=error (or similar). Clicking any data point in the panel opens Langfuse pre-filtered to the matching time range and filters.
What's the right alerting strategy for combined monitoring?
Alert on outcomes, not causes. Latency spike, cost spike, error rate, and silent failure (zero request rate) are the 4 outcomes that matter. Each alert should include a link to the Langfuse trace search pre-filtered to the relevant time window so on-call can land on the failing traces in 30 seconds.
Key takeaways
- Langfuse and Prometheus cover different layers. Use both: Langfuse for LLM-specific context, Prometheus for numeric time series and alerting.
- Add Langfuse's Postgres as a read-only Grafana data source. Query traces and observations tables directly with SQL.
- Build a single 6-panel unified dashboard. Mix Prometheus and Langfuse queries. 2x3 grid covers the full agent health picture.
- Use Grafana data links to jump from a metrics panel to pre-filtered Langfuse trace searches. Cuts debug time from 20 minutes to 2.
- Alert on 4 outcomes: latency spike, cost spike, model-specific error rate, silent failure. Every alert includes a Langfuse trace link.
- To see Langfuse + Grafana wired into a full production agent stack with evaluation and cost tracking, walk through the Agentic RAG Masterclass, or start with the AI Agents Fundamentals primer.
For the Grafana data source documentation covering Postgres, data links, and alerting, see the Grafana data sources guide.
Continue Reading
Ready to go deeper?
Go beyond articles. Build production AI systems with hands-on workshops and our intensive AI Bootcamp.