Your p95 latency is 8 seconds and you have no idea which step is slow

Users complain the agent is slow. Your dashboard shows p95 at 8 seconds. You open the trace viewer for one bad request and see the LLM call took 6 seconds, tool dispatch took 1.5, Redis took 300ms, and Postgres took 200ms. One trace. You have no idea if this is typical, which tool is slow, or whether the problem is consistent or bursty.

The fix is Prometheus metrics on every hop of the agent loop, segmented by the dimensions that matter: tool name, model name, prompt length bucket. With 4 well-designed histograms and 3 PromQL queries, you can answer "what is slow right now" in 30 seconds instead of digging through traces.

This post is the Prometheus performance analysis pattern for agentic AI: the 4 histograms that surface every bottleneck, the PromQL queries that rank them, and the alert rules that fire before users notice.

Why are traces alone insufficient for performance analysis?

Because traces are per-request and performance problems are statistical. A single slow trace tells you one request was slow. Metrics tell you 10 percent of requests are slow, and which tool is responsible. 3 specific failure modes of trace-only debugging:

  1. No aggregation. You cannot easily ask "what is the p95 tool dispatch latency across the last hour" from traces alone. You need to query a metrics store.
  2. No dimensionality. Traces do not slice easily by tool name, prompt length, or user tier without custom filtering.
  3. No alerting. You cannot alert on "tool X latency spiked" from traces. Alerts need metrics.

Prometheus histograms give you the aggregation, dimensionality, and alerting that traces lack, at a fraction of the storage cost.

graph LR
    Agent[Agent request] --> Span1[LLM call span]
    Agent --> Span2[Tool dispatch span]
    Agent --> Span3[Retrieval span]

    Span1 --> H1[llm_duration_seconds]
    Span2 --> H2["tool_duration_seconds tool=bash"]
    Span3 --> H3[retrieval_duration_seconds]
    Agent --> H4[agent_turn_duration_seconds]

    H1 & H2 & H3 & H4 --> Prom[(Prometheus)]
    Prom --> Dashboard[Grafana dashboard]
    Prom --> Alert[Alert rules]

    style Prom fill:#dbeafe,stroke:#1e40af
    style Dashboard fill:#dcfce7,stroke:#15803d

What 4 histograms cover every bottleneck?

  1. agent_turn_duration_seconds, the full request. Labels: endpoint, status.
  2. llm_duration_seconds, individual LLM calls. Labels: model, prompt_length_bucket.
  3. tool_duration_seconds, tool dispatch. Labels: tool name, status.
  4. retrieval_duration_seconds, vector search and context assembly. Labels: store, k (top-k value).

Together these cover the 4 stages of every agent turn. Latency issues will show up in one of them specifically, not across the board.

# filename: app/obs/metrics.py
# description: Prometheus histograms for agent performance analysis.
from prometheus_client import Histogram

AGENT_TURN = Histogram(
    'agent_turn_duration_seconds',
    'Total agent turn duration',
    ['endpoint', 'status'],
    buckets=(0.5, 1, 2, 5, 10, 20, 30, 60),
)

LLM_DURATION = Histogram(
    'llm_duration_seconds',
    'LLM call duration',
    ['model', 'prompt_length_bucket'],
    buckets=(0.5, 1, 2, 5, 10, 20, 30, 60),
)

TOOL_DURATION = Histogram(
    'tool_duration_seconds',
    'Tool dispatch duration',
    ['tool', 'status'],
    buckets=(0.05, 0.1, 0.5, 1, 2, 5, 10),
)

RETRIEVAL_DURATION = Histogram(
    'retrieval_duration_seconds',
    'Retrieval and context assembly duration',
    ['store', 'k'],
    buckets=(0.05, 0.1, 0.25, 0.5, 1, 2, 5),
)

For the broader Prometheus setup including the FastAPI wiring, see the Prometheus metrics for agentic AI observability post.

What are the 3 PromQL queries you actually use?

Query 1: which tool is slowest right now?

histogram_quantile(0.95,
  rate(tool_duration_seconds_bucket[5m])
) by (tool)

Returns the p95 tool duration grouped by tool name over the last 5 minutes. Sort the result; the slowest tool is your first fix target.

Query 2: is latency worse this hour vs last week?

(
  histogram_quantile(0.95, rate(agent_turn_duration_seconds_bucket[1h]))
  -
  histogram_quantile(0.95, rate(agent_turn_duration_seconds_bucket[1h] offset 1w))
) by (endpoint)

Returns the difference in p95 between now and 1 week ago. A positive number means latency got worse. Segment by endpoint to find which route regressed.

Query 3: which prompt length bucket is driving LLM latency?

histogram_quantile(0.95,
  rate(llm_duration_seconds_bucket[10m])
) by (prompt_length_bucket)

Returns p95 LLM latency segmented by prompt length bucket (< 500 tokens, 500-2000, 2000-8000, > 8000). Confirms whether long prompts are the tail latency driver.

For the circuit breaker pattern that mitigates long LLM calls, see the Circuit breakers for LLM calls post.

What alert rules should you set?

4 alerts that catch real performance problems.

# filename: alerts.yml
# description: Prometheus alert rules for agent performance.
groups:
  - name: agent_performance
    rules:
      - alert: AgentP95LatencyHigh
        expr: histogram_quantile(0.95, rate(agent_turn_duration_seconds_bucket[5m])) > 15
        for: 10m
        annotations:
          summary: "Agent p95 latency above 15s for 10m"

      - alert: ToolLatencyRegression
        expr: |
          histogram_quantile(0.95, rate(tool_duration_seconds_bucket[10m]))
          > 2 * histogram_quantile(0.95, rate(tool_duration_seconds_bucket[10m] offset 1d))
        for: 15m
        annotations:
          summary: "Tool {{ $labels.tool }} p95 doubled vs 1d ago"

      - alert: LLMLatencyHigh
        expr: histogram_quantile(0.95, rate(llm_duration_seconds_bucket[5m])) > 10
        for: 5m

      - alert: LongPromptsDominating
        expr: |
          histogram_quantile(0.95, rate(llm_duration_seconds_bucket{prompt_length_bucket="8000+"}[5m]))
          > 3 * histogram_quantile(0.95, rate(llm_duration_seconds_bucket{prompt_length_bucket="500-2000"}[5m]))
        for: 15m

These fire on 4 specific conditions: overall latency too high, a specific tool regressed, LLM latency too high, long-prompt tail dominating. Each targets a different fix.

What dashboard layout works for agent performance?

4 panels in one row, one topic each:

  1. Request rate and error rate by endpoint (line chart, 1h window). The health check.
  2. p50 and p95 agent turn duration (line chart, 1h). The top-level latency signal.
  3. p95 tool duration by tool name (heatmap or bar chart). Which tool is slowest right now.
  4. LLM latency by prompt length bucket (line chart). Does tail latency come from long prompts.

4 panels, 4 questions. No more. A crowded dashboard is one nobody reads.

For the broader observability picture combining metrics with traces, see the Langfuse integration for agentic AI tracing post.

What to do Monday morning

  1. Add the 4 histograms to your service if they are not already there. Label by the dimensions that matter for your workload.
  2. Import the 3 PromQL queries into your Grafana dashboard. Name each panel clearly.
  3. Set the 4 alert rules. Tune the thresholds based on your current baseline, not generic defaults.
  4. Run a synthetic load test and watch the dashboard. Confirm every panel shows reasonable data and the alerts fire correctly on induced slowdowns.
  5. Add a weekly review ritual: 5 minutes on Monday morning looking at p95 latency week-over-week. Flag anything that moved by more than 20 percent.

The headline: Prometheus turns "users say it's slow" into "tool X p95 spiked 3x in the last hour". 4 histograms, 3 queries, 4 alerts. Ship it before the next performance incident, not after.

Frequently asked questions

Why do I need metrics if I already have traces?

Because traces are per-request and performance problems are statistical. Metrics aggregate across thousands of requests and surface which tool, prompt bucket, or endpoint is consistently slow. Traces are for deep-diving one request; metrics are for catching patterns and alerting before users notice.

What histogram buckets should I use for LLM call duration?

Exponential buckets covering the expected range: (0.5, 1, 2, 5, 10, 20, 30, 60) seconds. Cover 100ms cache hits on the low end and 60-second long prompts on the high end. Too few buckets hide tail latency; too many blow up cardinality. 8-10 buckets is the sweet spot.

How do I segment LLM latency by prompt length?

Add a prompt_length_bucket label to the histogram with 4 values: < 500, 500-2000, 2000-8000, 8000+ (in tokens). Compute the bucket at the start of each LLM call and pass it as a label. This lets you confirm whether long prompts are driving your tail latency or if the issue is elsewhere.

What PromQL function do I use for p95 latency?

histogram_quantile(0.95, rate(<metric>_bucket[5m])) computes the 95th percentile latency over the last 5 minutes from a histogram metric. Wrap in by (label) to segment by a dimension. The rate(...[5m]) part converts the counter into a rate suitable for quantile estimation.

How many Prometheus alerts is too many?

More than 10 alerts per service means nobody reads them. Start with 4-6 targeting the specific failure modes that actually page humans: latency, error rate, rate limit, and a per-tool regression alert. Add more only when a specific production incident teaches you a new alert is worth having.

Key takeaways

  1. Metrics surface performance patterns that traces miss. Use both: traces for per-request debug, metrics for aggregates and alerts.
  2. 4 histograms cover every agent bottleneck: turn duration, LLM duration, tool duration, retrieval duration. Label by the dimensions that matter.
  3. Three PromQL queries find slow tools, detect latency regressions, and confirm prompt-length tail latency. Put them in your dashboard.
  4. Set 4 alerts targeting specific failure modes: overall p95, tool regression, LLM latency, long-prompt tail. Tune thresholds to your baseline.
  5. Keep dashboards opinionated: 4 panels, 4 questions. Crowded dashboards get ignored.
  6. To see Prometheus wired into a full production agent stack with tracing and cost tracking, walk through the Build your own coding agent course, or start with the AI Agents Fundamentals primer.

For the Prometheus histogram and PromQL documentation covering buckets, quantile math, and rate calculations, see the Prometheus best practices guide. Every pattern in this post is explained there in more depth.

Share this post

Continue Reading

Weekly Bytes of AI

Technical deep-dives for engineers building production AI systems.

Architecture patterns, system design, cost optimization, and real-world case studies. No fluff, just engineering insights.

Unsubscribe anytime. We respect your inbox.

Ready to go deeper?

Go beyond articles. Build production AI systems with hands-on workshops and our intensive AI Bootcamp.