FastAPI and Uvicorn for production agentic AI systems
You shipped an agent. Then you shipped it twice.
Your agent works perfectly on your laptop. You wrap it in a FastAPI route, run uvicorn main:app --reload, send a request from Postman, and the response streams back. Demo day goes well.
Then you put it behind a real URL. A second user hits the endpoint while the first is still mid-tool-call, and the whole server stalls. Your container restarts and every in-flight conversation evaporates. A long planning loop hits the 30-second proxy timeout. CPU sits at 4% while requests queue up.
The problem is not your agent code. The problem is that nobody told you how FastAPI and Uvicorn actually behave once the workload is agentic: long-running, streaming, and unpredictable.
This post is the production checklist I wish I had when I first put an agent behind FastAPI. We will look at why the defaults fail, what to change, and the exact uvicorn command you should be running on Monday morning.
Why tutorials lie about FastAPI in production
Most FastAPI tutorials are written for CRUD apps. A CRUD request is short, deterministic, and stateless. It pulls a row from Postgres, serializes JSON, returns in 30ms. Under that workload, uvicorn main:app with one worker is fine.
An agent request looks nothing like that. A single /chat call might:
- Hit an LLM (2-15 seconds, network-bound).
- Call a tool that runs a SQL query (300ms, blocking).
- Hit the LLM again with tool output (another 5 seconds).
- Stream tokens back to the user the whole time.
- Persist intermediate state to Postgres in case the user refreshes.
That single request can hold a connection open for 30+ seconds. If you have one Uvicorn worker and 4 concurrent users, 3 of them are waiting for nothing. If your agent code accidentally calls a sync function in the middle, the entire event loop blocks and every user waits.
This is why "it worked on my laptop" lies to you. You only had one user. Yourself.
What is Uvicorn and why does FastAPI need it?
Uvicorn is the ASGI server that actually runs your FastAPI application. FastAPI is just a routing and validation framework. It does not listen on a port. It does not manage workers. It does not understand HTTP. Uvicorn does all of that.
The important word here is ASGI, not WSGI. WSGI (the old standard, used by Flask and Django before 3.0) gives you one request, one response, one thread, no streaming. ASGI gives you an async event loop, long-lived connections, server-sent events, and WebSockets. Agents need every one of those things.
graph TD
Client[Browser or curl client] -->|HTTPS request| LB[Load balancer or reverse proxy]
LB -->|forwards request| Uvicorn[Uvicorn ASGI server]
subgraph Worker["Uvicorn worker process"]
Loop[asyncio event loop] -->|runs coroutine| FastAPI[FastAPI route handler]
FastAPI -->|await llm.stream| LLM[LLM provider]
FastAPI -->|await db.fetch| DB[(Postgres)]
FastAPI -->|yield chunks| Stream[StreamingResponse]
end
Uvicorn --> Worker
Stream -->|SSE chunks| Client
style Loop fill:#dbeafe,stroke:#1e40af
style FastAPI fill:#dcfce7,stroke:#15803d
If you only remember one thing: Uvicorn owns the event loop. Your agent code runs on that loop. Anything that blocks the loop blocks every user on that worker.
Why FastAPI beats Flask for AI agents
I get this question every week. The honest answer: for an AI agent, it is not close.
Flask is sync-first. Every request gets a thread. To stream LLM tokens you have to fight the framework, then deploy with a special server, then hope nothing in your stack breaks. FastAPI is async-first. async def chat(...) plus StreamingResponse is 2 lines and it just works.
3 concrete wins:
- One event loop, thousands of in-flight requests. While one request is waiting on the LLM, the loop services another. With Flask plus Gunicorn sync workers, you need one OS thread per concurrent request. That blows up at 50 users.
- Native streaming.
yieldfrom an async generator becomes a server-sent event. NoResponse(stream_with_context(...))gymnastics. - Type-safe tool I/O. Pydantic models are how FastAPI validates inputs. They are also how most LLM tool-calling libraries describe tools. You write the schema once.
I run Flask in a few legacy services. I would not start a new agent on it.
How do you run FastAPI in production for agents?
The command you should be running is not uvicorn main:app --reload. Reload mode is for development. It watches files, restarts the server on every save, and runs a single worker. In production you want the opposite: stable, multi-process, with graceful shutdowns.
Here is the minimum viable production command, and what each flag actually does.
# filename: run.sh
# description: Production Uvicorn command for an agent service.
# Run inside the container as a non-root user.
uvicorn app.main:app \
--host 0.0.0.0 \
--port 8000 \
--workers 4 \
--timeout-keep-alive 75 \
--timeout-graceful-shutdown 30 \
--log-config log_config.json \
--proxy-headers \
--forwarded-allow-ips '*'
Reading that flag by flag:
--workers 4: spawns 4 independent processes, each with its own event loop. Rule of thumb for I/O-bound agent workloads: start with2 * num_cpus. Do not go wild. More workers means more concurrent LLM connections, which means you hit the provider rate limit faster.--timeout-keep-alive 75: keeps the TCP connection open between streamed chunks. If your agent pauses for 60 seconds while a tool runs, you do not want the socket to die.--timeout-graceful-shutdown 30: when you redeploy, give in-flight requests 30 seconds to finish before SIGKILL. Without this, every deploy drops every active conversation.--proxy-headersand--forwarded-allow-ips '*': tells Uvicorn to trustX-Forwarded-Forfrom your load balancer, so request logs show real client IPs instead of the LB's internal address.
Notice what is not in there: --reload. If you see --reload in a production Dockerfile, that is your bug.
Uvicorn or Gunicorn for AI workloads?
For most AI agents in 2026, just use Uvicorn directly. The old advice was "Gunicorn with Uvicorn workers gives you a more battle-tested process manager." That was true in 2020. Uvicorn's own multi-worker mode is now stable, simpler to configure, and does not introduce a second layer of timeouts to debug.
The exception: if you need per-worker memory limits, max-requests recycling, or you are already standardized on Gunicorn across other services, run gunicorn -k uvicorn.workers.UvicornWorker app.main:app -w 4. Both work. Pick one and stop bikeshedding.
How do you stream LLM responses with FastAPI?
Use StreamingResponse with an async generator. Yield each token as a server-sent event line, and let Uvicorn flush it down the wire.
# filename: app/routes/chat.py
# description: Minimal streaming endpoint for an agent. Yields SSE chunks
# as the LLM produces tokens.
from fastapi import APIRouter
from fastapi.responses import StreamingResponse
from app.agent import run_agent # your agent's async generator
router = APIRouter()
@router.post('/chat')
async def chat(payload: dict):
async def event_stream():
async for chunk in run_agent(payload['message']):
# SSE format: data: <payload>\n\n
yield 'data: ' + chunk + '\n\n'
yield 'data: [DONE]\n\n'
return StreamingResponse(
event_stream(),
media_type='text/event-stream',
headers={
'Cache-Control': 'no-cache',
'X-Accel-Buffering': 'no', # disable Nginx buffering
},
)
2 things will bite you here. First, if you put Nginx or Cloudflare in front, both will buffer responses by default and your "streaming" endpoint will look like a normal slow request. The X-Accel-Buffering: no header disables it for Nginx; for Cloudflare you need a Workers script or to use Transfer-Encoding chunked. Second, every await inside run_agent must be a real async call. One sneaky requests.get(...) instead of httpx.get(...) and the entire worker stops streaming for every user. I have shipped that bug. It is humbling.
If you want a guided walkthrough of building the agent that sits behind this endpoint, the Build Your Own Coding Agent course walks through the loop, tool calls, and streaming end to end. If you are still wiring up your first agent and want the conceptual model first, start with the free AI Agents Fundamentals resource.
The lifespan trap: where agents quietly die
The single biggest reason production agent services fall over is shared state created at module import time. People write client = OpenAI() at the top of main.py, push to production, and a week later the service is leaking connections and the LLM client's internal pool is poisoned across workers.
Use FastAPI's lifespan context. Create clients on startup, close them on shutdown, attach them to app.state.
# filename: app/main.py
# description: Lifespan-managed clients so each Uvicorn worker has its own
# client pool, properly closed on shutdown.
from contextlib import asynccontextmanager
from fastapi import FastAPI
from openai import AsyncOpenAI
@asynccontextmanager
async def lifespan(app: FastAPI):
app.state.llm = AsyncOpenAI()
app.state.db = await create_pg_pool()
yield
await app.state.db.close()
await app.state.llm.close()
app = FastAPI(lifespan=lifespan)
Why this matters: when you set --workers 4, Uvicorn forks 4 processes. Each process gets its own copy of app.state. If you create the LLM client at import time, you get 4 independent clients with shared file descriptors that can corrupt each other under load. With lifespan, each worker initializes cleanly.
You can see the same pattern applied to a richer chatbot stack in System Design: Building a Production-Ready AI Chatbot, where the lifespan also wires up vector stores, Redis, and the LangGraph runtime.
What to do Monday morning
A short, do-this-now list:
- Open your
Dockerfile. If theCMDsays--reload, delete it. Replace with the productionuvicorncommand from this post. - Search your codebase for
requests.,time.sleep, and any sync DB driver. Anything inside anasync defroute is a worker-killer. Swap forhttpx,asyncio.sleep, and an async driver likeasyncpg. - Move every long-lived client (LLM, database, vector store) into a
lifespanblock. Read it fromapp.state, never from a module global. - Add
--timeout-graceful-shutdown 30so deploys do not nuke streaming sessions. - Put a real load test in front of it. 5 concurrent fake users sending agent prompts will tell you in 30 seconds whether your event loop is actually free.
None of this is exotic. It is the boring infrastructure that decides whether your agent feels like a polished product or a side project people give up on.
Frequently asked questions
What is Uvicorn in FastAPI?
Uvicorn is the ASGI server that runs your FastAPI application. FastAPI defines the routes and validation, but it does not listen on a port. Uvicorn opens the socket, runs the asyncio event loop, and dispatches each incoming request to your async route handlers. Without an ASGI server like Uvicorn (or Hypercorn), a FastAPI app cannot serve traffic at all.
Should I use Uvicorn or Gunicorn for an AI agent?
For new AI services, run Uvicorn directly with --workers N. It is simpler, has fewer timeout layers, and supports everything an agent needs out of the box. Use Gunicorn with UvicornWorker only if you need its process manager features like max-requests recycling or per-worker memory limits, or if your platform already standardizes on Gunicorn.
How many Uvicorn workers should I run for an agent service?
Start with 2 * num_cpus and load test from there. Agent workloads are I/O bound, so you can usually go higher than for CPU-bound services, but each extra worker also opens more concurrent connections to your LLM provider. Hit the rate limit and your throughput drops, not rises. Measure before you scale.
Why does my FastAPI agent stop responding under load?
The most common cause is a synchronous call inside an async def route. A requests.get, a sync database query, or a time.sleep will block the event loop, freezing every user on that worker. The fix is to replace sync libraries with their async equivalents (httpx, asyncpg, asyncio.sleep) or to push the blocking work into a thread pool with run_in_threadpool.
How do I stream LLM responses through FastAPI?
Return a StreamingResponse wrapping an async generator that yields server-sent event chunks. Set media_type='text/event-stream' and add X-Accel-Buffering: no so Nginx does not buffer the response. Inside the generator, every call must be awaitable. One blocking call and the stream stalls for everyone on the worker.
Key takeaways
- FastAPI handles routes; Uvicorn owns the event loop. Most production failures are about Uvicorn behavior, not FastAPI code.
- Drop
--reloadin production. Use--workers,--timeout-keep-alive, and--timeout-graceful-shutdownto survive long agent calls and zero-downtime deploys. - Every long-lived client (LLM, database, vector store) belongs in a
lifespanblock, not a module global. Workers must initialize cleanly. - One sync call inside an async route blocks every user on that worker. Audit your imports for
requests, sync DB drivers, andtime.sleep. - Streaming only works end to end if your reverse proxy stops buffering. Set
X-Accel-Buffering: noand verify withcurl -N. - When you are ready to apply this to a real agent loop with tool calls and persistence, the Build Your Own Coding Agent course walks through it module by module, and the free AI Agents Fundamentals primer covers the conceptual model first.
For the official background on the flags above, the Uvicorn deployment docs are the source of truth. The lifespan pattern is documented in FastAPI's lifespan events guide. The recipes here are what those docs look like once you have been burned by an agent workload in production.
Continue Reading
Ready to go deeper?
Go beyond articles. Build production AI systems with hands-on workshops and our intensive AI Bootcamp.