Your agent took 30 seconds and your token expired

A user kicks off a long agent task. The agent does its planning, calls 3 tools, hits the LLM twice, and finally returns after 28 seconds. Somewhere in that flow, the auth token expired. The final response tries to stream back and hits a 401 middleware check. The user sees a cryptic error and loses their conversation.

Token auth for agents is subtly different from token auth for CRUD. CRUD calls finish in 50 milliseconds so the window for expiry never matters. Agent calls hold sessions open for tens of seconds or minutes, which means the tokens you issue need to either outlive those calls or be refreshable in place. Most JWT tutorials do not tell you this and you find out at 3am.

This post is the JWT pattern I use for agentic APIs: access and refresh token split, the claims that matter, rotation on every refresh, and the middleware that handles long-running agent calls without logging the user out mid-thought.

Why are JWTs the right fit for agent services?

Because agent backends are usually stateless and horizontally scaled, and stateful sessions add infrastructure cost they do not need. A JWT is self-contained: the server verifies the signature, reads the claims, and knows who the user is without hitting a session store.

3 properties that matter for agent workloads:

  1. Stateless verification. Every worker can validate a token without a Redis lookup. That means you can add workers without also scaling a session store.
  2. Short access token lifetime. You can issue 15-minute tokens because you have a refresh mechanism. Short lifetime caps the blast radius of a stolen token.
  3. Claims carry everything the agent needs. Tenant ID, user ID, role, and rate-limit plan all travel inside the token. The agent loop reads them without a database call.

The downside of JWTs is revocation. Once issued, a JWT is valid until it expires. You cannot log a user out mid-flight unless you add a revocation list. This trade-off is fine for agent workloads because access tokens are short-lived; the refresh rotation pattern handles the rest.

graph TD
    Login[POST /login] --> Issue[Issue access + refresh]
    Issue --> Client[Client stores both]
    Client -->|access in header| API[API calls]
    API -->|access expired| Refresh[POST /refresh with refresh token]
    Refresh --> NewPair[New access + refresh]
    NewPair --> Client

    style Issue fill:#dbeafe,stroke:#1e40af
    style Refresh fill:#fef3c7,stroke:#b45309
    style NewPair fill:#dcfce7,stroke:#15803d

2 tokens, 2 different lifetimes, 1 rotation step. That is the whole pattern.

Should you use access tokens and refresh tokens for agents?

Yes. A short-lived access token (15 minutes) and a longer-lived refresh token (14 days). The access token is what every API call carries. The refresh token is only used to get a new access token when the current one expires.

Why 2 tokens instead of 1 long-lived token? Because a stolen access token expires in 15 minutes, so the attacker has a small window. A stolen refresh token is still dangerous but it lives in a more secure place (httpOnly cookie, never in JS memory, never sent with every request) so it is harder to steal.

# filename: tokens.py
# description: Issue and verify access and refresh tokens with PyJWT.
# Access tokens are short-lived and carry claims; refresh tokens are opaque
# identifiers backed by a refresh-token record in the database.
import uuid
from datetime import datetime, timedelta
import jwt
from pydantic import BaseModel

JWT_SECRET = 'change-me-load-from-env'
JWT_ALG = 'HS256'
ACCESS_TTL = timedelta(minutes=15)
REFRESH_TTL = timedelta(days=14)


class Claims(BaseModel):
    sub: str         # user id
    tenant_id: int
    role: str
    plan: str
    exp: datetime
    iat: datetime


def issue_access(user_id: str, tenant_id: int, role: str, plan: str) -> str:
    now = datetime.utcnow()
    payload = {
        'sub': user_id,
        'tenant_id': tenant_id,
        'role': role,
        'plan': plan,
        'iat': now,
        'exp': now + ACCESS_TTL,
    }
    return jwt.encode(payload, JWT_SECRET, algorithm=JWT_ALG)


def verify_access(token: str) -> Claims:
    payload = jwt.decode(token, JWT_SECRET, algorithms=[JWT_ALG])
    return Claims.model_validate(payload)

The claims carry sub (user), tenant_id, role, and plan. Anything the agent loop needs for routing and authorization travels in the token, so the loop never hits the database to resolve auth on every step.

What claims should an agentic JWT carry?

Only what the agent needs for authorization and rate limiting. Not profile data, not preferences, not email. The token should be small (JWTs get sent with every request) and stable (adding a new claim invalidates every existing token).

5 claims that earn their space:

  1. sub: the user ID. Used for session ownership and audit logging.
  2. tenant_id: the multi-tenant scope. The data layer uses this as the first filter on every query (see the User and Session Models for Multi-Tenant AI Agents post).
  3. role: owner, admin, or member. The permission layer uses this to decide what the user can see.
  4. plan: free, pro, enterprise. The rate limiter picks bucket parameters based on this.
  5. exp and iat: the standard JWT expiry and issued-at fields.

Do not put email, display name, or anything else that can change into the token. Every mutable field in the token is a field you cannot change without re-issuing every user's tokens. Keep them stable.

How do you handle refresh token rotation?

On every refresh, issue a new access token AND a new refresh token. Invalidate the old refresh token by deleting its row from the database. This is called "rotation" and it is the single most important security property of the pattern.

Why rotate? Because if an attacker steals a refresh token, they and the legitimate user both hold the same secret. The moment one of them uses it, a new refresh token is issued and the old one is dead. The next time the other party tries to use the old token, they get rejected, and you can detect the race (called "reuse detection") and log the user out of every device.

# filename: refresh.py
# description: Rotate refresh tokens. Every refresh request invalidates the
# presented token and issues a new pair. Detects reuse.
from sqlmodel import Session as DbSession, select
from models import RefreshToken


def refresh_pair(db: DbSession, presented: str) -> tuple[str, str]:
    row = db.exec(select(RefreshToken).where(RefreshToken.token == presented)).first()
    if not row:
        raise ValueError('invalid or already used refresh token')
    if row.used_at is not None:
        # Reuse detected. Log out every refresh token for this user.
        user_tokens = db.exec(
            select(RefreshToken).where(RefreshToken.user_id == row.user_id)
        ).all()
        for t in user_tokens:
            t.used_at = datetime.utcnow()
            db.add(t)
        db.commit()
        raise ValueError('refresh token reused; session terminated')

    row.used_at = datetime.utcnow()
    db.add(row)

    new_refresh = RefreshToken(
        user_id=row.user_id,
        token=str(uuid.uuid4()),
        expires_at=datetime.utcnow() + REFRESH_TTL,
    )
    db.add(new_refresh)
    db.commit()

    access = issue_access(row.user_id, row.tenant_id, row.role, row.plan)
    return access, new_refresh.token

Read the reuse-detection branch. If a refresh token is presented twice, it means either the legitimate user retried (unlikely if the client code is correct) or an attacker copied it. Either way, the safest response is to invalidate every refresh token for that user. They have to log in again.

How does the FastAPI middleware handle long agent calls?

The agent call accepts the access token at the start and verifies it once. The long LLM stream does not need to re-verify. If the token expires mid-stream, that is fine because the stream is already authorized.

The client is responsible for refreshing before making the next call. Most clients refresh automatically when they get a 401 on a regular CRUD call, then retry. The pattern:

# filename: middleware.py
# description: JWT middleware for FastAPI that validates access tokens on
# every request. Long streaming responses verify once at the start.
from fastapi import Request, HTTPException
from tokens import verify_access, Claims


async def jwt_middleware(request: Request, call_next):
    if request.url.path in ('/login', '/refresh', '/health'):
        return await call_next(request)

    auth = request.headers.get('authorization', '')
    if not auth.startswith('Bearer '):
        raise HTTPException(401, 'missing bearer token')

    try:
        claims: Claims = verify_access(auth[7:])
    except jwt.ExpiredSignatureError:
        raise HTTPException(401, 'token expired', headers={'x-token-expired': 'true'})
    except jwt.InvalidTokenError:
        raise HTTPException(401, 'invalid token')

    request.state.user_id = claims.sub
    request.state.tenant_id = claims.tenant_id
    request.state.role = claims.role
    request.state.plan = claims.plan
    return await call_next(request)

Notice the x-token-expired: true header on the specific expired-token error. Clients can look for that and know to refresh without parsing error messages. It is a small but critical affordance.

For the broader auth setup with multi-tenant guards and session models, see the User and Session Models for Multi-Tenant AI Agents post. The tenant guard expects the JWT middleware to have populated request.state before data-layer calls fire.

What should the client do about long agent streams?

2 rules. First, refresh the access token before starting a long agent call if the token is more than half expired. Proactive refresh is cheaper than reactive. Second, treat a x-token-expired response as a signal to refresh once and retry once, not to loop forever.

For the production stack picture that ties together auth, streaming, pools, and rate limiting, see the FastAPI and Uvicorn for Production Agentic AI Systems post and the System Design: Building a Production-Ready AI Chatbot walkthrough.

What to do Monday morning

  1. Stop using 1 long-lived token. Split into access (15 min) and refresh (14 days) if you have not already. This is the biggest security improvement in this post.
  2. Audit your JWT claims. Remove anything mutable (display name, email, plan if it changes often). Keep only what is stable and needed for routing and authorization.
  3. Add rotation. Every refresh issues a new refresh token and invalidates the old one. Detect reuse and log the user out of every session when it happens.
  4. Add the x-token-expired response header so clients can distinguish "expired, refresh me" from "invalid, log out" without parsing error strings.
  5. Move JWT_SECRET into environment variables, loaded via Pydantic Settings. See the .env.development vs .env: Environment Config for Agentic Systems post if you do not have that pattern in place yet.

The headline: JWT auth for agents is JWT auth for CRUD plus short access tokens, rotation, and one extra header for long calls. 200 lines total. Ship it before the first production incident.

Frequently asked questions

Why use JWTs instead of server-side sessions for agent APIs?

Because JWTs are stateless and horizontal scale is free. Every worker can verify a token without a Redis or database lookup, which keeps latency low and avoids a shared-session bottleneck. The trade-off is that revocation is harder (JWTs are valid until they expire), which is solved by keeping access tokens short and using refresh token rotation.

What claims should a JWT carry for an agentic API?

Only what the agent needs for authorization: sub (user ID), tenant_id (multi-tenant scope), role (permissions), plan (rate limit bucket), and the standard exp and iat. Nothing mutable and nothing large. Every extra claim is bytes on every request and every change requires re-issuing existing tokens.

How long should access and refresh tokens live?

Access tokens: 15 minutes. Refresh tokens: 14 days. Short access tokens cap the damage from a steal; long refresh tokens keep the user logged in across sessions without constant re-logins. Rotation on every refresh closes the replay window for the longer-lived token.

What is refresh token rotation?

Each time a client uses a refresh token to get a new access token, the server issues a new refresh token and invalidates the old one. If the old token is presented again, the server detects reuse and logs the user out of every device. This defeats the attack where a stolen refresh token is used in parallel with the legitimate user's.

How do you keep long agent calls authenticated without the token expiring mid-call?

Verify the access token once at the start of the call. The subsequent LLM streaming and tool execution run inside the authorized context and do not re-verify. If the client needs to make a new call and the token has expired, it refreshes first. Long streams never need to re-check auth because the decision was made at the start.

Key takeaways

  1. Use access and refresh tokens, never a single long-lived token. Short access tokens cap the blast radius of a steal; refresh tokens keep users logged in.
  2. Claims carry only stable, route-relevant data: sub, tenant_id, role, plan, exp, iat. Nothing mutable, nothing large.
  3. Rotate refresh tokens on every use and detect reuse. A replayed refresh token is almost always theft and should trigger a full session logout.
  4. Verify the access token once at the start of a long agent call. Streaming and tool execution run inside the authorized context without re-verification.
  5. Return x-token-expired: true on expired-token 401s so clients can distinguish "refresh and retry" from "log out" without parsing error strings.
  6. To see JWT auth wired into a full production agent stack with tenant guards, rate limits, and streaming, walk through the Build Your Own Coding Agent course, or start with the AI Agents Fundamentals primer.

For the original JWT RFC and the security best practices that inform this post, see the JWT best current practice RFC 8725. The reuse-detection pattern and the claim minimization rule both come from there.

Share this post

Continue Reading

Weekly Bytes of AI

Technical deep-dives for engineers building production AI systems.

Architecture patterns, system design, cost optimization, and real-world case studies. No fluff, just engineering insights.

Unsubscribe anytime. We respect your inbox.

Ready to go deeper?

Go beyond articles. Build production AI systems with hands-on workshops and our intensive AI Bootcamp.