Your agent retries a 400 Bad Request 5 times and your bill shows it

The user sent a malformed prompt. Your agent got back a 400 Bad Request. Your retry loop did not check the error code and retried 5 times with exponential backoff. That is 5 failed calls that each cost tokens on the bad-request path, 5 log entries, and a user waiting 40 seconds for an error they could have seen in 200 ms.

The fix is retry logic that knows which errors are retryable. Rate limits, timeouts, and 5xx errors deserve a retry with backoff. 4xx errors, authentication failures, and content-policy rejections do not. The whole thing is 40 lines of Python, and it saves money on every deploy.

This post is the LLM retry pattern: the error classes to retry, the backoff math, the jitter that prevents thundering-herd on recovery, and the 3 error codes you must never retry.

Why do naive retry loops hurt more than help?

Because "retry on any exception" is a foot-gun. 3 specific failure modes:

  1. Retrying 4xx errors. Bad request, unauthorized, content policy, all deterministic. Retrying them burns tokens and delays the error response. The user waits longer for a bug that was already determined in call 1.
  2. No backoff. A tight retry loop on a rate-limited call hits the rate limit again immediately. The provider throttles you harder and your latency gets worse.
  3. No jitter. If 100 agents all retry at exactly 1 second, they all hit the API at the same moment and all get rate-limited again. Add random jitter or you will sync up and melt the provider.

A correct retry policy is shaped by which errors are transient and which are deterministic. Transient errors get retries with backoff plus jitter. Deterministic errors get raised immediately.

graph TD
    Call[LLM API call] --> Err{Error?}
    Err -->|No| Return[Return result]
    Err -->|Yes| Type{Error type}
    Type -->|429, 5xx, timeout| Retry[Backoff + jitter]
    Type -->|400, 401, 403| Raise[Raise immediately]
    Retry --> Cap{Attempts < max?}
    Cap -->|Yes| Call
    Cap -->|No| Raise

    style Retry fill:#dbeafe,stroke:#1e40af
    style Raise fill:#fee2e2,stroke:#b91c1c
    style Return fill:#dcfce7,stroke:#15803d

Which errors should you retry?

3 retryable categories.

  1. Rate limit (HTTP 429). Transient by definition. The provider is telling you to slow down. Retry with backoff.
  2. Server errors (HTTP 5xx). The provider has a problem. Usually resolves within seconds. Retry with backoff.
  3. Timeouts and connection errors. Network glitches, DNS hiccups, TLS failures. Retry once or twice with a fresh connection.

3 non-retryable categories.

  1. Bad request (HTTP 400). Your prompt is malformed or exceeds context. Retrying does nothing.
  2. Authentication (HTTP 401, 403). Your API key is wrong or rate-capped. Retrying does nothing.
  3. Content policy violations. The provider rejected the prompt. Retrying with the same prompt is useless, and retrying with a rewritten prompt is a different codepath.

How do you build the retry loop?

Tenacity handles most of the mechanics. The provider-specific exception classes live in their SDKs.

# filename: app/llm/retry.py
# description: Retry policy for LLM API calls with error classification.
from tenacity import retry, stop_after_attempt, wait_exponential_jitter, retry_if_exception_type
from anthropic import APIStatusError, APIConnectionError, RateLimitError


def _is_retryable(exc: BaseException) -> bool:
    if isinstance(exc, (APIConnectionError, RateLimitError)):
        return True
    if isinstance(exc, APIStatusError):
        return exc.status_code >= 500
    return False


@retry(
    stop=stop_after_attempt(4),
    wait=wait_exponential_jitter(initial=1, max=30, jitter=2),
    retry=retry_if_exception_type((APIConnectionError, RateLimitError, APIStatusError)),
    reraise=True,
)
def call_llm_with_retry(client, **kwargs):
    try:
        return client.messages.create(**kwargs)
    except APIStatusError as e:
        if e.status_code < 500:
            raise  # Do not retry 4xx errors
        raise

4 decisions worth calling out. stop_after_attempt(4) caps at 4 total tries (1 initial + 3 retries). wait_exponential_jitter doubles the wait on each retry, capping at 30 seconds and adding up to 2 seconds of random jitter. The _is_retryable helper explicitly checks the status code so 4xx errors escape the retry loop. reraise=True means the original exception propagates on final failure, not a tenacity wrapper.

For the broader resilience pattern including circuit breakers, see the Circuit breakers for agentic AI post.

Why does jitter matter so much?

Because without jitter, failures synchronize. Imagine 200 agents all hit a rate limit at the same moment. Without jitter, they all retry at exactly 1 second, then 2 seconds, then 4 seconds. The provider sees 200 simultaneous requests every retry window. With jitter, the retries spread across 1-3 seconds, 2-4 seconds, 4-6 seconds. The provider sees a smooth load instead of a spike.

This is the difference between a system that recovers gracefully and a system that amplifies its own outages. The math is well-known, the AWS Architecture Blog has a canonical post on exponential backoff with full jitter that every backend engineer should read.

How do you handle rate limits intelligently?

Rate limits often include a retry-after header telling you exactly when to try again. Use it instead of guessing with exponential backoff.

# filename: app/llm/rate_limit.py
# description: Respect the retry-after header on 429 responses.
import time
from anthropic import RateLimitError


def call_respecting_rate_limit(client, **kwargs):
    for attempt in range(4):
        try:
            return client.messages.create(**kwargs)
        except RateLimitError as e:
            retry_after = int(e.response.headers.get("retry-after", 2 ** attempt))
            time.sleep(retry_after + (0.1 * attempt))  # Small jitter
    raise RuntimeError("Rate limit after 4 attempts")

When the provider gives you a specific retry time, use it. When it does not, fall back to exponential backoff with jitter.

For the token-bucket rate limiter on your own API side, see the Rate limiting FastAPI agents with token buckets post.

What should you log on every retry?

3 specific fields so you can debug later:

  1. Error class and status code. Tells you which error you hit.
  2. Attempt number and wait time. Tells you if backoff is working.
  3. Request ID or trace ID. Lets you correlate with your observability system.

Do not log the full request body on every retry, that is how you leak secrets into log aggregators. Log the trace ID and look up the request in your own storage if you need details.

What to do Monday morning

  1. Find your current LLM API call wrapper. Read the retry logic. If it retries on any exception, you have a bug.
  2. Add the _is_retryable helper that checks for 429, 5xx, connection errors, and nothing else.
  3. Add wait_exponential_jitter with a 30-second cap. Cap attempts at 4.
  4. Respect the retry-after header on 429 responses.
  5. Run the agent against a malformed prompt. Confirm the 400 error raises after 1 call, not 4.
  6. Add retry metrics (attempt count, final status) to your observability dashboard.

The headline: retries should be precise, bounded, and classified by error type. 40 lines of Python fixes a $100/month bug and makes your agent behave in degraded conditions.

Frequently asked questions

Which LLM API errors should I retry?

Retry transient errors: 429 rate limits, 5xx server errors, and network/timeout errors. Do not retry 4xx client errors: 400 bad request, 401 unauthorized, 403 forbidden, and content-policy rejections. Retrying deterministic errors wastes money and delays the error response the user should see immediately.

Why does my retry loop cause more rate limiting?

Because you are retrying too aggressively or without jitter. Without backoff, the retry hits the rate limit again instantly. Without jitter, multiple agents retry in sync and produce a thundering herd. Fix both: exponential backoff starting at 1 second, capped at 30 seconds, with random jitter up to 2 seconds added on each retry.

How many retries are enough?

4 total attempts (1 initial + 3 retries) is the sweet spot for most LLM calls. More than that adds latency without catching many additional failures. If you need more retries, the upstream is probably broken and a circuit breaker is a better pattern than more retries.

Should I use tenacity or write my own retry logic?

Use tenacity for standard cases. It handles stop conditions, wait strategies, exception classification, and jitter with 10 lines of decorators. Write your own only when you need something tenacity cannot express, like respecting a provider's retry-after header with custom logic.

What is exponential backoff with jitter?

A retry wait strategy where each attempt waits longer than the last (doubling) plus a random offset. The exponential part prevents hammering a degraded service; the jitter prevents multiple clients from synchronizing their retries. Together they give you graceful recovery under load and no thundering herd on service restoration.

Key takeaways

  1. Retry only transient errors: 429, 5xx, and network failures. Never retry 400, 401, 403, or content-policy errors, they are deterministic.
  2. Use exponential backoff with jitter. Start at 1 second, cap at 30, add up to 2 seconds of random jitter per attempt.
  3. Respect the retry-after header on 429 responses when the provider gives you one. It is more accurate than exponential guessing.
  4. Cap total attempts at 4. More retries rarely help and always cost.
  5. Log attempt count, error class, and trace ID on every retry. Do not log full request bodies, they contain secrets.
  6. To see retries wired into a full production agent stack with circuit breakers and observability, walk through the Build your own coding agent course, or start with the AI Agents Fundamentals primer.

For the canonical reference on exponential backoff and jitter, see the AWS Architecture Blog on retries.

Share this post

Continue Reading

Weekly Bytes of AI

Technical deep-dives for engineers building production AI systems.

Architecture patterns, system design, cost optimization, and real-world case studies. No fluff, just engineering insights.

Unsubscribe anytime. We respect your inbox.

Ready to go deeper?

Go beyond articles. Build production AI systems with hands-on workshops and our intensive AI Bootcamp.