Stress testing agentic AI systems beyond the laptop
Your agent works on your laptop and dies at 50 concurrent users
You tested your agent by sending curl requests one at a time. All 10 requests worked. You shipped. 50 users logged in at the same time. The service started returning 502s at user 47. You scaled up. It still failed. You rolled back.
The problem is that agent services behave non-linearly under concurrency. 1 request is fine. 10 requests is fine. 50 requests exhausts your connection pool, hits your LLM provider rate limit, runs out of FD table space, or hits a lock your tests never stressed. The only way to find these problems is to actually stress test the service with the concurrency pattern real traffic has.
This post is the stress test pattern for agentic AI: the load profile that matches real traffic, the metrics to capture, the 5 things that break first, and the reporting format that tells you what to fix.
Why are AI agents hard to load test?
Because each request holds resources much longer than a CRUD request and consumes much more variable token volume. 3 specific failure modes you only see under concurrency:
-
Pool exhaustion. A 30-second agent request holds a DB connection for the entire turn. 50 concurrent requests + 20-connection pool = 30 requests waiting on a connection that never frees.
-
Provider rate limits. OpenAI and Anthropic enforce per-minute token and request caps. Single-request testing never triggers them. Concurrent load hits them immediately and every call starts failing.
-
LLM tail latency. The 99th percentile LLM call can be 10x the median. Without concurrency, you never see the tail. With concurrency, the tail dominates user experience.
A proper stress test simulates concurrent users sending varied prompts and measures what breaks first.
graph LR
Load[Load generator: 100 virtual users] --> Agent[Agent service]
Agent --> Pool[(DB pool)]
Agent --> LLM[LLM provider]
Agent --> Cache[(Redis)]
Agent --> Metrics[Metrics:<br/>p50, p95, p99 latency<br/>error rate<br/>tokens per turn<br/>pool wait time]
Metrics --> Report[Stress test report]
style Load fill:#dbeafe,stroke:#1e40af
style Report fill:#dcfce7,stroke:#15803d
What load profile matches real agent traffic?
3 patterns to run in sequence. Each simulates a different real-world scenario.
- Ramp: 1 virtual user → 100 users over 5 minutes. Catches linear degradation and identifies the breaking point concurrency.
- Burst: 0 → 50 users in 10 seconds, sustained for 2 minutes. Catches cold-start issues and provider rate limits.
- Sustained: 30 users for 20 minutes. Catches connection leaks, memory leaks, and slow degradation over time.
All 3 should run against a staging environment that mirrors production infrastructure.
How do you write the load test script?
Use Locust. It is Python-native, supports async, and integrates with pytest fixtures.
# filename: tests/load/locustfile.py
# description: Locust load test for an agent service.
from locust import HttpUser, task, between
import random
PROMPTS = [
'How do I set up a Redis cache for session data?',
'Explain the difference between asyncio and threading in Python.',
'Show me a production FastAPI setup with JWT auth.',
'Why is my agent losing state between restarts?',
'How does LangGraph checkpointing work?',
# ... 50-100 real prompts sampled from production logs
]
class AgentUser(HttpUser):
wait_time = between(2, 8) # simulate realistic user pause
def on_start(self):
# Simulate login
self.token = self.client.post('/auth/login', json={
'email': '[email protected]', 'password': 'loadtest',
}).json()['access_token']
@task
def chat(self):
self.client.post(
'/chat',
json={
'message': random.choice(PROMPTS),
'session_id': f'load-{random.randint(1, 1000)}',
},
headers={'Authorization': f'Bearer {self.token}'},
timeout=120,
)
Run with:
locust -f tests/load/locustfile.py --host https://staging.yourservice.com --users 100 --spawn-rate 2
3 details that matter. Prompts are sampled from real production logs, not hand-written. The session ID is randomized so users do not stomp on each other. wait_time simulates the 2-8 second pause a real user takes between messages.
What metrics do you capture?
4 metric groups:
- Latency: p50, p95, p99 response time. Segment by endpoint.
- Error rate: 4xx, 5xx, timeouts. Segment by status code.
- Upstream metrics: LLM provider request count, tokens consumed, rate limit headers (
x-ratelimit-remaining). - Resource metrics: DB connection pool usage, Redis memory, worker CPU, process memory.
Capture all 4 during the test. A failure in any one identifies a specific bottleneck.
For the Prometheus metric setup that these come from, see the Prometheus metrics for agentic AI observability post.
What are the 5 things that break first?
In order of how often I see them break under load:
- DB connection pool exhaustion. Symptom: requests queue on the pool. Fix: raise pool size, pre-ping connections, or add PgBouncer.
- LLM provider rate limits. Symptom: 429 responses from the provider. Fix: request higher limits, cache repeated prompts, or route through multiple API keys.
- Redis memory. Symptom: cache evictions spike, cache miss rate rises. Fix: raise memory limit or add a TTL policy.
- Tail latency on specific prompts. Symptom: p99 is 20x p50. Fix: find the slow prompt pattern and add a timeout or pre-compute.
- Worker process OOM. Symptom: Uvicorn workers get killed, containers restart. Fix: lower worker count or raise memory limit.
Every stress test should identify which of these 5 breaks first for your service. That is the #1 thing to fix before the next test.
For the circuit breaker pattern that handles the LLM rate limit case gracefully, see the Circuit breakers for LLM calls post.
What to do Monday morning
- Stand up a staging environment that mirrors production infrastructure as closely as possible. Different specs = different results.
- Sample 50-100 real prompts from production logs, anonymized. These are your load test prompts.
- Write a Locust file that simulates realistic user behavior: login, multi-prompt session, 2-8 second pause between messages.
- Run the 3-phase profile: ramp, burst, sustained. Capture latency, errors, upstream, and resource metrics throughout.
- Identify what broke first. Fix it. Re-run the same test. Confirm the fix and note what breaks next.
- Before every production release, re-run the full stress test against staging. If anything regresses, block the release.
The headline: stress testing is the only way to find the non-linear failures that production traffic will expose. 3 load profiles, 4 metric groups, 5 common failure modes. Run on staging, fix what breaks, repeat until the service survives 2x your expected peak.
Frequently asked questions
Why can't I just send curl requests to test my agent?
Because agent services behave non-linearly under concurrency. A single request works fine. 50 concurrent requests hit connection pool limits, LLM provider rate limits, and tail latency that a single-request test never sees. Real traffic always has concurrency, and the failures it exposes only show up under actual load.
What load test tool should I use for agent services?
Locust for Python-native teams. It supports async workers, integrates with pytest fixtures, and has a live web UI for watching a test run. k6 for teams that prefer JavaScript and statically-compiled load generators. JMeter for enterprise teams with existing JMeter infrastructure. All three work; pick based on what your team already knows.
What concurrency level should I target in the test?
2x your expected peak. If production sees 50 concurrent users at the 99th percentile, test at 100. The 2x buffer covers unexpected traffic spikes, autoscaling lag, and the fact that load tests usually underestimate real traffic patterns. If the service breaks at 1.5x, you have no safety margin.
What breaks first under load in a typical agent service?
Database connection pool exhaustion, followed by LLM provider rate limits, then Redis memory, then tail latency on slow prompts, then worker process OOM. In that rough order. Every stress test should identify which of these 5 bottlenecks your service hits first and address it before the next run.
How often should I stress test?
Before every significant release, and on a scheduled weekly run in staging. The scheduled run catches slow drift (a new dependency that bloats memory, a config change that narrows the pool). The release-time run catches regressions from code changes. Both are cheap compared to a production outage.
Key takeaways
- Agents behave non-linearly under concurrency. Single-request testing misses every real bottleneck.
- Use Locust (or k6) to run 3 load profiles: ramp, burst, sustained. Each catches different failure modes.
- Sample real prompts from production logs. Hand-written test prompts miss the long tail that breaks production.
- Capture latency (p50, p95, p99), error rate, upstream provider metrics, and resource usage during every test.
- The 5 things that break first: DB pool, LLM rate limits, Redis memory, tail latency, worker OOM. Fix in that order.
- To see stress testing wired into a full production agent stack with CI integration and rollback automation, walk through the Build your own coding agent course, or start with the AI Agents Fundamentals primer.
For the Locust documentation covering async users, distributed runs, and the web UI, see the Locust docs. The distributed-runner section is worth reading before load testing anything larger than your laptop can drive.
Continue Reading
Ready to go deeper?
Go beyond articles. Build production AI systems with hands-on workshops and our intensive AI Bootcamp.