Prometheus metrics for agentic AI observability
Your agent is slow and you have no dashboard
A user complains that the agent got slower over the last week. You check your logs. You grep for latency. You find nothing useful because logs are not the right shape for latency analysis. You end up asking the user "how slow" and guessing about the cause.
Traces (covered in the Langfuse Integration for Agentic AI Tracing post) are the right tool for per-request debugging. Prometheus metrics are the right tool for aggregate trends. You need both. This post is the metrics side: the 4 counters and histograms every agent service should expose, the dashboard layout, and the histogram bucket trap that makes p99 latency numbers lie.
Why do agent services need Prometheus metrics?
Because aggregate trends tell you things traces cannot. Traces are great at "why did this specific request fail." Metrics are great at "is anything trending badly over the last week." You need both or you will be blind to either incidents or regressions.
3 questions only metrics can answer:
- Is p95 latency climbing? Over 24 hours, over 7 days, compared to last week?
- Is the error rate higher on one worker than another? Across all requests, not just the ones you noticed?
- Are we paying more per user this month than last month? Cost per request over time, grouped by user or tenant?
All 3 are questions about distributions over time, which is exactly what Prometheus was built for. Traces can theoretically answer them too, but the query performance and retention costs make Prometheus the right tool for the job.
graph TD
Req[Agent requests] --> Instr[Prometheus instrumentation]
Instr --> C1[Counter: request count]
Instr --> C2[Counter: errors]
Instr --> H1[Histogram: latency]
Instr --> H2[Histogram: tokens]
C1 --> Prom[Prometheus scraper]
C2 --> Prom
H1 --> Prom
H2 --> Prom
Prom --> Grafana[Grafana dashboard]
Prom --> Alert[Alert manager]
style Instr fill:#dbeafe,stroke:#1e40af
style Grafana fill:#dcfce7,stroke:#15803d
4 metric primitives cover everything agent services need to monitor.
What are the 4 metrics that matter for agent services?
1. Request counter
Count every request, labeled by endpoint, status, and tenant. This is the foundation for rate queries ("how many requests per second"), error rates ("error count divided by total count"), and traffic splitting ("what percentage of traffic goes to endpoint X").
# filename: metrics.py
# description: Define the 4 Prometheus metrics every agent service needs.
# Counters for requests and errors, histograms for latency and tokens.
from prometheus_client import Counter, Histogram
REQUESTS = Counter(
'agent_requests_total',
'Total agent requests',
['endpoint', 'status', 'tenant'],
)
ERRORS = Counter(
'agent_errors_total',
'Total agent errors by type',
['endpoint', 'error_type'],
)
LATENCY = Histogram(
'agent_latency_seconds',
'Agent request latency',
['endpoint'],
buckets=(0.1, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0, 60.0, 120.0),
)
TOKENS = Histogram(
'agent_tokens_total',
'Total tokens per request',
['endpoint', 'direction'],
buckets=(100, 500, 1000, 2500, 5000, 10000, 25000, 50000),
)
2. Error counter
Count errors separately from successful requests. Label by error type (timeout, provider_error, auth_error, rate_limit) so you can see which category is driving the error rate. Without type labels, you see that "errors are up" but cannot tell which root cause.
3. Latency histogram
Histograms let you compute p50, p95, p99 latencies over any time window. The bucket choice matters: buckets that are too coarse hide real latency differences, buckets that are too fine waste cardinality.
4. Token histogram
Agent requests vary wildly in token usage. A simple lookup might use 500 tokens; a complex multi-turn session might use 50000. Tracking the distribution tells you when users are hitting the top of the distribution (which usually costs money) and when the distribution is shifting over time.
How do you wire metrics into a FastAPI agent?
A middleware that increments counters and observes histograms on every request. 30 lines total.
# filename: prometheus_middleware.py
# description: FastAPI middleware that records metrics for every agent
# request. Runs once per request, adds no noticeable latency.
import time
from fastapi import FastAPI, Request
from prometheus_client import make_asgi_app
from metrics import REQUESTS, ERRORS, LATENCY
async def prometheus_middleware(request: Request, call_next):
start = time.perf_counter()
endpoint = request.url.path
tenant = getattr(request.state, 'tenant_id', 'unknown')
try:
response = await call_next(request)
except Exception as exc:
ERRORS.labels(endpoint=endpoint, error_type=type(exc).__name__).inc()
REQUESTS.labels(endpoint=endpoint, status='error', tenant=str(tenant)).inc()
raise
duration = time.perf_counter() - start
LATENCY.labels(endpoint=endpoint).observe(duration)
REQUESTS.labels(
endpoint=endpoint,
status=str(response.status_code),
tenant=str(tenant),
).inc()
return response
app = FastAPI()
app.middleware('http')(prometheus_middleware)
app.mount('/metrics', make_asgi_app())
The /metrics endpoint exposes the Prometheus-format scrape target. Your Prometheus server scrapes it every 15 seconds (or whatever interval you set) and stores the time series.
For the broader observability picture that pairs Prometheus with Langfuse traces, see the Langfuse Integration for Agentic AI Tracing post. Prometheus handles aggregate trends; Langfuse handles per-request forensics. You want both.
What is the histogram bucket trap?
Histogram buckets define the granularity of your latency percentiles. If your buckets are [0.1, 1, 10, 60], your p95 can only be one of those 4 values. A real p95 of 3.5 seconds reports as "less than 10 seconds," which hides the signal that latency is creeping up from 2 to 3.5.
3 rules for bucket choice:
- Cover the expected range. If your agent calls take 1 to 30 seconds, buckets should span that range. Leave a buffer at the top for outliers.
- Use logarithmic spacing. Doubling each bucket (0.1, 0.2, 0.5, 1, 2, 5, 10, 30) gives good resolution at both ends.
- Keep it under 15 buckets. More buckets means more cardinality in Prometheus, which slows queries and costs storage.
The default buckets the Prometheus Python client ships with are tuned for web APIs (millisecond range) and are wrong for agent services. Override them explicitly using the buckets in the code above.
How do you build a good dashboard?
4 panels, each answering a different question.
- Request rate by endpoint. Line graph, 1-hour window. Shows traffic shape. Sudden drops mean clients disconnected; sudden spikes mean abuse.
- Error rate by endpoint. Line graph showing errors as a percentage of total requests. Any spike above 1 to 2 percent is a signal to investigate.
- Latency p50 and p95 by endpoint. 2 lines on the same graph. The gap between them shows the tail-latency story. Normal services have a small gap; ones with runaway outliers have a huge gap.
- Token usage p95 by endpoint. Tells you when requests start using more tokens than expected. Usually precedes a cost spike by a few days.
Add more panels only if you have specific questions to answer. Dashboards with 20 panels look impressive and get ignored. 4 focused panels get checked every morning.
What should you alert on?
3 alerts that catch real production issues.
- Error rate above 2 percent for 5 minutes. Catches provider outages, bad deploys, and bugs that affect a subset of traffic.
- p95 latency above your SLA for 10 minutes. Set the SLA from user expectations, not from the current median. A service with p95 latency of 8 seconds when users expect 5 is broken even if nothing technically failed.
- Request rate below historical average by 50 percent for 15 minutes. Catches the case where a client started failing silently and stopped sending traffic. This is the "nobody is using us" alert that most teams forget.
Skip alerts on token usage unless you have a real budget problem; they fire too often and become noise.
For the broader production stack context, see the FastAPI and Uvicorn for Production Agentic AI Systems post. For the agent loop these metrics measure, see The Event Loop Inside a Coding Agent.
What to do Monday morning
- Install
prometheus-clientand add the 4 metric definitions from this post. 20 lines of code. - Wire the middleware into your FastAPI app. Run a test request and curl
/metricsto confirm the metrics are exposed. - Set up Prometheus (or use a hosted service like Grafana Cloud) to scrape your
/metricsendpoint every 15 seconds. Retention of 15 to 30 days is enough for most trend analysis. - Build the 4-panel dashboard: request rate, error rate, latency, token usage. Share the dashboard link in your team channel so everyone can check it every morning.
- Set the 3 alerts: error rate, p95 latency, low traffic. Point them at Slack or your on-call rotation. Tune the thresholds from 2 weeks of baseline data before going live.
The headline: Prometheus metrics tell you aggregate trends that traces cannot. 4 metrics, 4 dashboard panels, 3 alerts. Ship them together and your team finally has a first answer to "is anything weird happening right now."
Frequently asked questions
What metrics should an agentic AI service expose to Prometheus?
4 metrics: a request counter, an error counter, a latency histogram, and a token usage histogram. Each is labeled by endpoint (and optionally tenant or user) so you can slice by route. These 4 metrics answer every aggregate-trend question you will have about agent behavior in production. Skip the dozens of extra metrics until you know why you need them.
Why do I need metrics in addition to traces?
Because traces are great at per-request forensics and bad at aggregate trends. "Did p95 latency climb this week" is a query that works on aggregated metrics and fails on traces. Conversely, "why did this specific user's request fail" is a query for traces, not metrics. Production observability needs both, not either.
What histogram buckets should I use for agent latency?
Buckets that span your expected range with logarithmic spacing. For agent calls that take 1 to 60 seconds, buckets of [0.1, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0, 60.0, 120.0] give good resolution throughout. Default Python client buckets are tuned for millisecond-range web APIs and will hide real latency information on agent workloads.
How do I alert on agent service regressions?
3 core alerts: error rate above 2 percent for 5 minutes, p95 latency above your SLA for 10 minutes, and request rate below historical average by 50 percent for 15 minutes. Tune the thresholds from baseline data. Add more alerts only when you have specific questions; dashboards with 20 alerts become noise and get ignored.
Should I alert on token usage?
Only if you have a real budget problem. Token usage varies wildly by request type, and alerts on aggregate token usage fire too often to be actionable. Better pattern: dashboard the p95 token usage and look at it manually once a week to spot drift. Alert only on a per-user anomaly (one user using 10x their historical average).
Key takeaways
- Metrics answer aggregate questions that traces cannot. Production observability needs both Prometheus and a trace system, not one or the other.
- 4 metrics cover 95 percent of agent observability: request counter, error counter, latency histogram, token histogram. Skip the rest until you need them.
- Histogram bucket choice is the single biggest mistake in metric design. Use logarithmic buckets that span your expected latency range, not default web-API buckets.
- Build a 4-panel dashboard: request rate, error rate, latency p50/p95, token usage. More panels means less signal per panel and more fatigue.
- Alert on error rate, p95 latency, and traffic drops. Those 3 catch most production issues without generating alert fatigue.
- To see Prometheus wired into a full production agent stack with auth, streaming, and tracing, walk through the Build Your Own Coding Agent course, or start with the AI Agents Fundamentals primer.
For the official Prometheus Python client documentation and best practices for histogram and counter design, see the Prometheus Python client docs. The bucket selection guidance there matches the rules in this post and applies to any histogram metric.
Continue Reading
Ready to go deeper?
Go beyond articles. Build production AI systems with hands-on workshops and our intensive AI Bootcamp.