Rate limiting FastAPI agents: token buckets in production
One curl loop and your agent bill is $2000
You ship your agent. A friendly user writes a benchmarking script. The script hits your /chat endpoint in a tight loop, 200 requests in 4 minutes. Each request fires a long LLM call. The user thought they were being polite by stopping at 200. Your OpenAI bill shows $2000 the next morning.
You did not have a rate limiter. "I'll add one later" is the sentence every dev has said right before this happened. Rate limiting is not optional on an agent service. Each request costs real money at the provider, and the cost per request is 100x a normal CRUD endpoint. Without a limiter, a single bad client (malicious or just buggy) can drain your budget in minutes.
This post is the token bucket algorithm, the Redis implementation that works across multiple Uvicorn workers, and the FastAPI middleware that drops it in without touching your route handlers.
Why don't simple rate limiters work for agent workloads?
Because agent requests are bursty and expensive, and most simple limiters optimize for the wrong dimension.
3 limiter designs and why they fail:
- In-memory counters per worker. Works until you have more than 1 worker. With 4 Uvicorn workers, a user can hit 4x the limit because each worker has its own counter. This is the bug every rate limiter tutorial ships with.
- Fixed window counters. "Max 100 requests per minute." A client sends 100 at 0:59 and 100 more at 1:00, getting 200 requests in 2 seconds. Fixed windows are cheap but leak at window boundaries.
- A flat per-request limit. Ignores that some requests are 10x more expensive than others. A limiter that allows 100 short greetings per minute also allows 100 long agent sessions per minute, which is a 10x cost difference.
You need a limiter that is shared across workers (Redis), smoothed over time (token bucket), and weighted by request cost (each request consumes a number of tokens, not just 1).
graph TD
Req[Agent request] --> Mid[Middleware]
Mid -->|key = user_id| Bucket[Redis token bucket]
Bucket -->|enough tokens?| Check{Decision}
Check -->|yes| Consume[Deduct tokens]
Consume --> Route[Agent route handler]
Check -->|no| Reject[429 Too Many Requests]
style Bucket fill:#dbeafe,stroke:#1e40af
style Reject fill:#fee2e2,stroke:#b91c1c
style Route fill:#dcfce7,stroke:#15803d
The token bucket is the algorithm that fixes all 3 failure modes at once.
What is a token bucket and why does it beat fixed windows?
A token bucket has 2 numbers: the current token count and the last-refill timestamp. Tokens refill at a steady rate (say, 1 per second) up to a maximum (the bucket size, say 60). Each request consumes some tokens (1 for cheap calls, 10 for expensive agent calls). If the bucket does not have enough tokens, the request is rejected with a 429.
3 properties make token buckets the right choice:
- Smooth. The rate is constant over time, no window boundary spikes.
- Bursty. A user can burn the whole bucket for a legitimate burst, then wait for it to refill. This matches how humans use agent services.
- Weighted. Different endpoints consume different token amounts, so a single expensive call can count for 10 cheap ones.
Mathematically, at each request: tokens = min(max_tokens, tokens + (now - last_refill) * refill_rate). If tokens >= cost, subtract the cost and proceed. Otherwise, reject.
How do you implement a token bucket in Redis?
The naive implementation is a 5-step read-modify-write with race conditions between workers. The correct implementation is a Lua script that runs atomically on Redis. Lua scripts are Redis primitives; they execute as a single operation so there is no read-modify-write window.
# filename: rate_limit.py
# description: Atomic token bucket in Redis using a Lua script.
# Safe across multiple workers without any external locking.
import time
from redis.asyncio import Redis
BUCKET_SCRIPT = """
local key = KEYS[1]
local max_tokens = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local cost = tonumber(ARGV[3])
local now = tonumber(ARGV[4])
local bucket = redis.call('HMGET', key, 'tokens', 'ts')
local tokens = tonumber(bucket[1]) or max_tokens
local ts = tonumber(bucket[2]) or now
local elapsed = now - ts
tokens = math.min(max_tokens, tokens + elapsed * refill_rate)
if tokens < cost then
redis.call('HMSET', key, 'tokens', tokens, 'ts', now)
redis.call('EXPIRE', key, 3600)
return 0
end
tokens = tokens - cost
redis.call('HMSET', key, 'tokens', tokens, 'ts', now)
redis.call('EXPIRE', key, 3600)
return 1
"""
class TokenBucket:
def __init__(self, redis: Redis, max_tokens: int = 60, refill_rate: float = 1.0):
self.redis = redis
self.max_tokens = max_tokens
self.refill_rate = refill_rate
self.script = redis.register_script(BUCKET_SCRIPT)
async def try_consume(self, key: str, cost: int = 1) -> bool:
result = await self.script(
keys=[f'bucket:{key}'],
args=[self.max_tokens, self.refill_rate, cost, time.time()],
)
return bool(result)
Read the Lua script carefully. It does the read, the refill math, the comparison, and the write as a single Redis operation. No worker can interleave. The EXPIRE call at the end makes sure abandoned buckets clean themselves up after an hour.
How do you wire the bucket into FastAPI middleware?
One middleware that intercepts every request, extracts the user ID, and calls the bucket. Reject with 429 if the bucket is empty. Let the request through otherwise.
# filename: middleware.py
# description: FastAPI middleware that applies token bucket rate limiting
# per authenticated user, with different costs per endpoint.
from fastapi import Request, Response
from fastapi.responses import JSONResponse
from rate_limit import TokenBucket
ROUTE_COSTS = {
'/chat': 10, # agent calls are expensive
'/search': 2, # vector search is cheap-ish
'/health': 0, # free
}
def get_cost(path: str) -> int:
for prefix, cost in ROUTE_COSTS.items():
if path.startswith(prefix):
return cost
return 1
async def rate_limit_middleware(request: Request, call_next):
bucket: TokenBucket = request.app.state.bucket
user_id = getattr(request.state, 'user_id', None) or request.client.host
cost = get_cost(request.url.path)
if cost == 0:
return await call_next(request)
allowed = await bucket.try_consume(user_id, cost)
if not allowed:
return JSONResponse(
status_code=429,
content={'error': 'rate limit exceeded', 'retry_after': 5},
headers={'Retry-After': '5'},
)
return await call_next(request)
3 details that matter. First, the cost is per-route, not per-request, because /chat really does cost 10x what /search does. Second, the middleware falls back to the client IP when no user is authenticated (so unauthenticated attackers cannot bypass the limiter by omitting credentials). Third, the Retry-After header tells well-behaved clients when to retry, which cuts down on retry storms.
How do you size the bucket for an agent service?
Pick the refill rate from cost, not convenience. A typical agent request costs $0.02 in LLM fees. A generous rate of 1 request every 3 seconds (0.33 refill rate) means a user burns $0.40 per minute of sustained calling. That is fine for real users, painful for bots.
Worked example. You want to cap a free user at $10 per day. $10 / $0.02 = 500 requests per day. 500 / 86400 seconds ≈ 0.0058 refill rate. Round up to 0.01 (one request every 100 seconds) for safety. Bucket size 60 so a user can burst 60 requests in the first minute, then has to wait.
Paid users get a higher rate proportional to their plan. The bucket key becomes bucket:{plan}:{user_id} so different plans do not collide. A simple dict maps plan to (max_tokens, refill_rate) and the middleware picks the right one.
For the broader production stack picture with rate limits, see the FastAPI and Uvicorn for Production Agentic AI Systems post. For JWT-based auth that provides the user ID this limiter keys on, see the JWT Authentication for Agentic APIs walkthrough.
When should you also rate limit by token count, not request count?
When your agent has highly variable context sizes and a short prompt call is 10x cheaper than a long one. In that case, a 10-token cost for every /chat call overcharges the cheap calls and undercharges the expensive ones.
The fix is post-hoc metering. After the agent call returns, read the actual token usage from the LLM response and deduct that from the bucket. The bucket becomes a "dollars of LLM usage per minute" meter instead of a request counter. You lose the "reject before running" guarantee (because you do not know the cost until after the call), but you gain accurate cost control.
A hybrid works well: reject before running based on a cheap estimate (10 tokens per call), then after the call finishes, deduct the actual cost minus the estimate. The bucket might go slightly negative on expensive calls; the user simply waits longer for it to refill. This is how most production agent services meter usage.
What to do Monday morning
- Install Redis if you do not already have one. Any reasonable hosted Redis (or a tiny self-hosted instance) will handle thousands of bucket operations per second.
- Drop the
TokenBucketclass from this post into your service. Register the Lua script on startup vialifespan. - Add the middleware to your FastAPI app. Start with per-IP limiting if you do not have auth yet; switch to per-user once auth is in place.
- Tune
max_tokensandrefill_ratefrom your actual cost model, not from gut feel. $10 per day per free user is a reasonable starting point. - Add a test that fires 100 requests at the same user ID within a second and confirms you get a 429 after the bucket is empty. Run it in CI forever.
The headline: a multi-worker token bucket in Redis is 40 lines of Lua plus 20 lines of middleware. It is the difference between a service that can be drained by one curl loop and one that cannot.
Frequently asked questions
Why do AI agent services need rate limiting?
Because each request costs real money at the LLM provider, and the cost per request is 10 to 100 times a normal CRUD endpoint. Without a limiter, a single client can drain your entire provider budget in minutes. Rate limiting for agent services is not about fairness or server load; it is about bounding the worst-case dollar exposure per user per hour.
What is a token bucket rate limiter?
A token bucket keeps a per-user counter that refills at a steady rate up to a maximum. Each request consumes tokens. If the bucket does not have enough, the request is rejected. It handles bursty traffic (users can burn the bucket fast) and smooth traffic (constant refill) equally well, which matches how humans actually use agent services.
How do you rate limit across multiple Uvicorn workers?
Use a shared store like Redis with an atomic Lua script that does the refill and deduction in one operation. In-memory counters per worker let a user hit N times the limit where N is the worker count. A Redis Lua script gives you exactly-one-counter semantics without locks, and the script runs in under a millisecond for a typical bucket.
Should I rate limit by request count or token count?
Both. Start with request count because it is simple and rejects before running. Add token-count metering afterward by deducting actual LLM token usage from the same bucket after each call. The hybrid gives you pre-call protection (no runaway costs) and post-call accuracy (fair charging for variable-cost calls).
What HTTP status code should a rate limiter return?
429 Too Many Requests, with a Retry-After header indicating how long the client should wait before retrying. Well-behaved HTTP clients parse Retry-After automatically and back off, which dramatically reduces retry storms. Setting a short retry window (5 to 10 seconds) is more useful than a long one because the bucket refills continuously.
Key takeaways
- Agent services need rate limiting from day 1 because each call costs real money and one misbehaving client can drain the budget in minutes.
- Token buckets handle bursty, variable-cost traffic correctly. Fixed window counters and in-memory limiters both leak in production.
- Implement the bucket with a Redis Lua script for atomic read-modify-write across multiple Uvicorn workers. The script runs in under a millisecond.
- Wire the bucket into FastAPI with a single middleware. Per-route costs let a single limiter serve both cheap and expensive endpoints fairly.
- Size the bucket from dollar cost per request, not from convenience. $10 per day per free user is a reasonable starting point.
- To see this rate limiter wired into a full production agent stack with auth and observability, walk through the Build Your Own Coding Agent course, or start with the AI Agents Fundamentals primer.
For the original token bucket algorithm and its properties, see the Wikipedia entry on token bucket. The Redis Lua pattern in this post matches the one used by most production limiter libraries, which borrow from the same paper.
Continue Reading
Ready to go deeper?
Go beyond articles. Build production AI systems with hands-on workshops and our intensive AI Bootcamp.