Connection pooling for production Python AI services

Your agent works at 10 users and dies at 100

You load-test your agent service with 10 concurrent users. Everything is fine. You push it to 100 and the logs fill with psycopg2.OperationalError: server closed the connection unexpectedly, SSL SYSCALL error: EOF detected, and the all-time classic broken pipe. The agent loop itself is perfect. Postgres is healthy. And yet every 10th request dies.

The problem is not your code. The problem is that nobody told you how database connections actually work in a long-lived async Python service. Every agent request holds a connection for the duration of the LLM call (several seconds, sometimes tens of seconds). Without a pool, you are either opening a fresh connection per request (slow) or trying to reuse a stale one that Postgres killed minutes ago (broken).

This post is the pool sizing rule, the pre-ping trick, the lifespan pattern, and the exact configuration that survives real agent traffic.

Why do agent services hit "broken pipe" errors under load?

Because an agent request holds a connection far longer than a CRUD request does. A CRUD endpoint opens a connection, runs a query in 30 milliseconds, and returns it to the pool. An agent endpoint opens a connection, kicks off an LLM call, waits 8 seconds, runs a tool that makes a query, waits again, and finally returns. That single request can tie up a connection for 30 seconds.

Now multiply by concurrent users. At 100 concurrent agent requests with a pool size of 10, 90 users are waiting for a connection that is busy streaming tokens from an LLM. At the same time, Postgres has an idle timeout (typically 5 minutes) and your infrastructure has a TCP keepalive timeout (typically 2 minutes). Connections that sit around while you wait for the LLM get killed silently. The next request to pull one from the pool gets a dead socket and crashes.

graph TD
    R1[Request 1] -->|grab conn| Pool[Connection Pool]
    R1 -->|hold for 30s while LLM responds| Hold[Connection idle but 'in use']
    Hold -->|TCP keepalive fires| Kill[Infra kills socket]
    Kill --> Dead[Connection dead]
    R2[Request 2] -->|grabs 'live' connection| Dead
    Dead --> Broken[Broken pipe error]

    style Broken fill:#fee2e2,stroke:#b91c1c

3 things have to be right to survive: pool size, pre-ping before use, and proper recycling of stale connections.

What is a connection pool and how does SQLAlchemy implement it?

A connection pool is a fixed set of open database connections that your service reuses across requests. Opening a fresh Postgres connection takes 50-100 milliseconds (TCP handshake, TLS negotiation, authentication). Reusing one takes zero. The pool trades memory for speed.

SQLAlchemy's default pool is QueuePool. You configure it when you create the engine. 4 parameters do most of the work:

# filename: db.py
# description: SQLAlchemy async engine with a properly sized connection pool
# for an agentic AI service.
from sqlalchemy.ext.asyncio import create_async_engine
from sqlalchemy.pool import AsyncAdaptedQueuePool

engine = create_async_engine(
    'postgresql+asyncpg://user:pass@host:5432/agents',
    poolclass=AsyncAdaptedQueuePool,
    pool_size=20,
    max_overflow=10,
    pool_pre_ping=True,
    pool_recycle=1800,
)

pool_size=20: 20 long-lived connections kept open at all times.
max_overflow=10: up to 10 extra temporary connections during burst traffic (discarded after use).
pool_pre_ping=True: test every connection with SELECT 1 before handing it to your code. This is the single most important flag in the whole config.
pool_recycle=1800: force-close connections older than 30 minutes. Prevents silent death from Postgres idle timeout.

Drop any one of these and your agent will eventually throw broken pipe at 3am.

How do you size a pool for agent workloads?

The standard formula is pool_size = (peak_concurrent_requests / num_workers) + buffer. Pick the peak from your load tests, divide by the number of Uvicorn workers, add 20 percent headroom.

Worked example. You run 4 Uvicorn workers and expect 100 concurrent users at peak. Each worker handles 25 concurrent requests. Add 20 percent headroom: 25 * 1.2 = 30. So pool_size=20 with max_overflow=10 gives you 30 per worker, matching the load.

Do not oversize. Postgres has a hard max_connections limit (default 100). With 4 workers each holding 30 connections, you are already at 120 before any other service connects. Oversized pools hit the Postgres connection ceiling and cascade to total outage.

The real trap: every Uvicorn worker has its own pool. If you set pool_size=50 with 8 workers, you have 400 connections to Postgres from one service, which is almost certainly more than your database can handle. Always multiply by worker count when calculating.

Why is pool_pre_ping the most important flag?

Because it is the difference between a pool that handles transient network issues gracefully and one that crashes on every cloud provider maintenance window.

pool_pre_ping=True tells SQLAlchemy to run SELECT 1 on every connection it pulls from the pool, before handing it to your code. If the ping fails, the connection is discarded and a fresh one is checked out. Your code never sees the dead connection.

The cost is one round trip per request (typically under 1 millisecond on the same VPC). The benefit is surviving every flavor of silent connection death: Postgres idle timeouts, NAT table expiration, load balancer connection cycling, cloud provider maintenance.

I have shipped 4 agent services without pre-ping and zero with it after the last one. Pre-ping is not optional. It is the price of operating a long-lived Python service that talks to a database.

How do you wire the pool into FastAPI with lifespan?

Create the engine on startup, close it on shutdown, attach it to app.state. Never create the engine at import time.

# filename: main.py
# description: FastAPI lifespan that creates the engine on startup and
# closes it cleanly on shutdown. One engine per worker.
from contextlib import asynccontextmanager
from fastapi import FastAPI, Depends
from sqlalchemy.ext.asyncio import AsyncSession, async_sessionmaker
from db import engine


@asynccontextmanager
async def lifespan(app: FastAPI):
    app.state.engine = engine
    app.state.session_factory = async_sessionmaker(engine, expire_on_commit=False)
    yield
    await engine.dispose()


app = FastAPI(lifespan=lifespan)


async def get_session(app: FastAPI = Depends()):
    async with app.state.session_factory() as session:
        yield session

Notice the await engine.dispose() on shutdown. This gracefully closes every connection in the pool when you redeploy. Without it, Postgres sees a sudden drop of 30 sockets and logs an error for each one.

Each Uvicorn worker forks with its own lifespan run, so each worker has its own engine and its own pool. This is exactly what you want. If you created the engine at module import time, all 4 workers would share one engine, which breaks under async workloads because the pool is not fork-safe.

For a deeper walkthrough of the lifespan pattern in production FastAPI agent services, see the FastAPI and Uvicorn for Production Agentic AI Systems post. For the broader production picture, see the System Design: Building a Production-Ready AI Chatbot walkthrough.

When should you use PgBouncer in front of the pool?

When you have more than 4 Uvicorn workers (say, a container running 8 workers or a fleet of 20 containers each running 4 workers). At that point, even with reasonable pool sizes per worker, you will saturate Postgres max_connections.

PgBouncer is a lightweight connection pooler that sits between your app and Postgres. Your SQLAlchemy pool connects to PgBouncer (fast, local), and PgBouncer multiplexes those connections onto a much smaller pool of real Postgres connections. Typical ratios are 10:1, so 1000 app-side connections might map to 100 Postgres-side connections.

The gotcha: PgBouncer's default transaction pooling mode breaks prepared statements and server-side cursors. For SQLAlchemy async, you want pool_mode = session or you need to disable prepared statement caching on the asyncpg driver. Add prepared_statement_cache_size=0 to the asyncpg connect args when running behind PgBouncer in transaction mode.

Most teams do not need PgBouncer for the first year. Add it when your Postgres dashboard shows connection counts climbing past 50 percent of max_connections during peak.

What to do Monday morning

Open your create_engine call. If it has no pool_pre_ping, add it right now. This is the single biggest reliability fix in this post.
Count your workers and multiply by your pool size. If the product exceeds 50 percent of Postgres max_connections, shrink the pool or cut workers.
Move engine creation into a FastAPI lifespan block if it is currently at module import time. This is also a prerequisite for clean shutdowns during rolling deploys.
Add pool_recycle=1800 to prevent silent death from Postgres idle timeouts. This catches the connections pre-ping does not.
Load test with 2x your expected concurrent users for 30 minutes. Watch the Postgres connection count and your app's error logs. Any broken pipe means the config is still wrong.

The headline: a properly sized pool with pre-ping and recycle is the difference between an agent service that survives real traffic and one that dies on every cloud maintenance window. 4 parameters. 15 minutes to configure. Several weeks of pain if you skip them.

Frequently asked questions

Why do agentic AI services need connection pooling?

Because agent requests hold database connections for tens of seconds while waiting for LLM responses and tool calls. Without a pool, you either open a fresh connection per request (slow, adds 100 milliseconds to every call) or reuse stale connections that have been silently killed by infrastructure timeouts. A pool with pre-ping keeps a warm set of connections and tests each one before handing it to your code.

What causes "broken pipe" errors in SQLAlchemy under load?

A connection in your pool has been silently killed between requests, usually by a TCP keepalive timeout, a Postgres idle timeout, a load balancer, or a cloud maintenance window. When the next request checks out that dead connection, the first query gets a broken pipe error. The fix is pool_pre_ping=True plus pool_recycle set below the smallest idle timeout in your infrastructure.

How do you size a SQLAlchemy connection pool for production?

Use the formula pool_size = (peak_concurrent_requests / num_workers) + 20 percent. Multiply the final pool size by the number of workers to check against Postgres max_connections. Never exceed 50 percent of max_connections from a single service, because other services and maintenance tasks also need connections. Typical values for agent services are pool_size=20 and max_overflow=10 per worker.

Should I use PgBouncer in front of SQLAlchemy?

Only when your total connection count (pool size times worker count times container count) approaches Postgres max_connections. Below that, PgBouncer adds latency and a failure mode without solving a real problem. Above that, PgBouncer multiplexes many app-side connections onto a small set of Postgres-side connections and prevents max_connections exhaustion.

Does asyncpg handle connection pooling better than psycopg2?

Asyncpg is significantly faster for async workloads and is the right choice for FastAPI agent services. It has its own pool, but SQLAlchemy's async engine wraps it and gives you the full SQLAlchemy pool behavior (pre-ping, recycle, overflow). Use the SQLAlchemy pool on top of asyncpg, not asyncpg's pool directly, so your config is portable and consistent.

Key takeaways

Agent requests hold database connections for tens of seconds, which breaks every assumption built for CRUD-style workloads. Connection pools are mandatory, not optional.
The 4 SQLAlchemy settings that matter: pool_size, max_overflow, pool_pre_ping=True, and pool_recycle=1800. Missing any one causes broken pipe errors under load.
Size the pool per worker, not globally. Always multiply by worker count when checking against Postgres max_connections.
Create the engine inside a FastAPI lifespan block. Module-level engines break across worker forks and prevent clean shutdowns on deploy.
Add PgBouncer only when total connection count approaches Postgres max_connections. Below that threshold it adds complexity without solving a real problem.
To see connection pooling wired into a full production agent stack with auth, streaming, and observability, walk through the Build Your Own Coding Agent course, or start with the AI Agents Fundamentals primer.

For the full SQLAlchemy pool documentation and advanced configurations, see the SQLAlchemy connection pooling docs. Every setting in this post is explained in detail there, with benchmarks.

Your agent works at 10 users and dies at 100

This post is the pool sizing rule, the pre-ping trick, the lifespan pattern, and the exact configuration that survives real agent traffic.

Why do agent services hit "broken pipe" errors under load?

graph TD
    R1[Request 1] -->|grab conn| Pool[Connection Pool]
    R1 -->|hold for 30s while LLM responds| Hold[Connection idle but 'in use']
    Hold -->|TCP keepalive fires| Kill[Infra kills socket]
    Kill --> Dead[Connection dead]
    R2[Request 2] -->|grabs 'live' connection| Dead
    Dead --> Broken[Broken pipe error]

    style Broken fill:#fee2e2,stroke:#b91c1c

3 things have to be right to survive: pool size, pre-ping before use, and proper recycling of stale connections.

What is a connection pool and how does SQLAlchemy implement it?

SQLAlchemy's default pool is QueuePool. You configure it when you create the engine. 4 parameters do most of the work:

# filename: db.py
# description: SQLAlchemy async engine with a properly sized connection pool
# for an agentic AI service.
from sqlalchemy.ext.asyncio import create_async_engine
from sqlalchemy.pool import AsyncAdaptedQueuePool

engine = create_async_engine(
    'postgresql+asyncpg://user:pass@host:5432/agents',
    poolclass=AsyncAdaptedQueuePool,
    pool_size=20,
    max_overflow=10,
    pool_pre_ping=True,
    pool_recycle=1800,
)

pool_size=20: 20 long-lived connections kept open at all times.
max_overflow=10: up to 10 extra temporary connections during burst traffic (discarded after use).
pool_pre_ping=True: test every connection with SELECT 1 before handing it to your code. This is the single most important flag in the whole config.
pool_recycle=1800: force-close connections older than 30 minutes. Prevents silent death from Postgres idle timeout.

Drop any one of these and your agent will eventually throw broken pipe at 3am.

How do you size a pool for agent workloads?

The standard formula is pool_size = (peak_concurrent_requests / num_workers) + buffer. Pick the peak from your load tests, divide by the number of Uvicorn workers, add 20 percent headroom.

Why is pool_pre_ping the most important flag?

Because it is the difference between a pool that handles transient network issues gracefully and one that crashes on every cloud provider maintenance window.

I have shipped 4 agent services without pre-ping and zero with it after the last one. Pre-ping is not optional. It is the price of operating a long-lived Python service that talks to a database.

How do you wire the pool into FastAPI with lifespan?

Create the engine on startup, close it on shutdown, attach it to app.state. Never create the engine at import time.

# filename: main.py
# description: FastAPI lifespan that creates the engine on startup and
# closes it cleanly on shutdown. One engine per worker.
from contextlib import asynccontextmanager
from fastapi import FastAPI, Depends
from sqlalchemy.ext.asyncio import AsyncSession, async_sessionmaker
from db import engine


@asynccontextmanager
async def lifespan(app: FastAPI):
    app.state.engine = engine
    app.state.session_factory = async_sessionmaker(engine, expire_on_commit=False)
    yield
    await engine.dispose()


app = FastAPI(lifespan=lifespan)


async def get_session(app: FastAPI = Depends()):
    async with app.state.session_factory() as session:
        yield session

When should you use PgBouncer in front of the pool?

Most teams do not need PgBouncer for the first year. Add it when your Postgres dashboard shows connection counts climbing past 50 percent of max_connections during peak.

What to do Monday morning

Open your create_engine call. If it has no pool_pre_ping, add it right now. This is the single biggest reliability fix in this post.
Count your workers and multiply by your pool size. If the product exceeds 50 percent of Postgres max_connections, shrink the pool or cut workers.
Move engine creation into a FastAPI lifespan block if it is currently at module import time. This is also a prerequisite for clean shutdowns during rolling deploys.
Add pool_recycle=1800 to prevent silent death from Postgres idle timeouts. This catches the connections pre-ping does not.
Load test with 2x your expected concurrent users for 30 minutes. Watch the Postgres connection count and your app's error logs. Any broken pipe means the config is still wrong.

Agent requests hold database connections for tens of seconds, which breaks every assumption built for CRUD-style workloads. Connection pools are mandatory, not optional.
The 4 SQLAlchemy settings that matter: pool_size, max_overflow, pool_pre_ping=True, and pool_recycle=1800. Missing any one causes broken pipe errors under load.
Size the pool per worker, not globally. Always multiply by worker count when checking against Postgres max_connections.
Create the engine inside a FastAPI lifespan block. Module-level engines break across worker forks and prevent clean shutdowns on deploy.
Add PgBouncer only when total connection count approaches Postgres max_connections. Below that threshold it adds complexity without solving a real problem.
To see connection pooling wired into a full production agent stack with auth, streaming, and observability, walk through the Build Your Own Coding Agent course, or start with the AI Agents Fundamentals primer.

For the full SQLAlchemy pool documentation and advanced configurations, see the SQLAlchemy connection pooling docs. Every setting in this post is explained in detail there, with benchmarks.

Connection pooling in production Python AI services

Share this post

Share this post

Continue Reading

Which language should you build Redis in? Lessons from rebuilding it 6 times

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Weekly Bytes of AI

Ready to go deeper?

Connection pooling in production Python AI services

Share this post

Share this post

Continue Reading

Which language should you build Redis in? Lessons from rebuilding it 6 times

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Weekly Bytes of AI

Ready to go deeper?