Resilient LLM services: Tenacity + fallback models

Your agent goes down every time OpenAI has a bad afternoon

A provider has a bad 30 minutes. Your agent starts returning 500s to users. Your on-call pager fires. You wait for the provider's status page to go green. You restart the service and move on. This happens every few weeks and you treat it as unavoidable because "the provider is down, what can we do."

You can do a lot, actually. 2 patterns turn an LLM outage into a transparent degradation instead of a hard failure: Tenacity retries for transient errors and fallback models for sustained outages. Together they can push your agent's effective uptime above the uptime of any single provider. Clients never see the outage because you silently shifted to a backup.

This post is the retry policy that is aggressive enough to help and not so aggressive that it makes the problem worse, the fallback chain that handles both provider outages and model-specific capacity issues, and the 60 lines of Python that wrap it all up.

Why do LLM calls fail in production?

3 main failure modes, each needing a different response.

Transient errors. 5xx from the provider, network hiccups, timeouts. The call would work if retried in a second. Most outages have bursts of these.
Rate limits. 429 from the provider when you exceed your per-minute token or request quota. Retrying immediately makes it worse; backing off helps.
Sustained outages. The provider is down for 30 minutes. Retries do not help because there is nothing to retry against. You need a fallback.

The default behavior of the official SDKs is "throw an exception." You want "retry transient errors, back off on rate limits, fail over on sustained outages." That is 3 different policies on the same call, layered correctly.

graph TD
    Call[LLM call] --> Primary[Primary: Sonnet]
    Primary -->|transient error| Retry[Retry with backoff]
    Retry -->|success| Done[Return]
    Retry -->|still failing after 3 tries| Fallback1[Fallback 1: GPT-4]
    Fallback1 -->|transient error| Retry1[Retry]
    Retry1 -->|still failing| Fallback2[Fallback 2: Haiku]
    Fallback2 -->|works| Done

    style Primary fill:#dbeafe,stroke:#1e40af
    style Fallback1 fill:#fef3c7,stroke:#b45309
    style Fallback2 fill:#fef3c7,stroke:#b45309
    style Done fill:#dcfce7,stroke:#15803d

The primary path covers the happy case. The retry path covers transient errors. The fallback path covers sustained outages of a specific provider.

How do you configure Tenacity for LLM retries?

Tenacity is the standard Python retry library. 3 decorators you need: @retry, stop_after_attempt, and wait_exponential. Wrap them around any LLM call.

# filename: retry.py
# description: Retry an LLM call with exponential backoff.
# Transient and rate-limit errors retry; auth errors do not.
from tenacity import (
    retry, stop_after_attempt, wait_exponential,
    retry_if_exception_type,
)
from anthropic import Anthropic, APIError, APIConnectionError, RateLimitError

client = Anthropic()


@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=30),
    retry=retry_if_exception_type((APIConnectionError, APIError, RateLimitError)),
    reraise=True,
)
def call_sonnet(prompt: str) -> str:
    reply = client.messages.create(
        model='claude-sonnet-4-6',
        max_tokens=1024,
        messages=[{'role': 'user', 'content': prompt}],
    )
    return reply.content[0].text

Read the retry policy carefully. 3 attempts total (1 initial + 2 retries). Exponential backoff starting at 2 seconds, capped at 30. Retries only on transient and rate-limit errors, not on auth errors or schema errors. reraise=True means the final exception is re-raised for the caller to handle.

These 4 settings are the ones that matter. Tune them based on your latency budget and the provider's typical recovery time. The defaults above work well for most workloads.

What errors should you not retry?

3 categories of errors that should fail fast, not retry.

Authentication errors (401). Your API key is wrong or revoked. Retrying will not help and every retry wastes time.
Validation errors (400). Your request body is malformed or the prompt triggered a content filter. Retrying sends the same bad request.
Permission errors (403). The model is not available for your account or the request violates usage policy. Not a transient condition.

Tenacity's retry_if_exception_type lets you whitelist retryable errors. Anything not in the list fails fast. This is the right default; opt in to retries only for the specific transient conditions you want to recover from.

How do you build a fallback chain?

A list of models, each with its own retry policy. Try the primary. If it fails after retries, try the next. Continue until one works or all fall back.

# filename: fallback.py
# description: A fallback chain of LLM providers. Try primary with retries,
# then the next model, then the next, until one succeeds.
import logging
from tenacity import retry, stop_after_attempt, wait_exponential
from anthropic import Anthropic, APIError
from openai import OpenAI, OpenAIError

anthropic = Anthropic()
openai = OpenAI()


def _call_sonnet(prompt: str) -> str:
    reply = anthropic.messages.create(
        model='claude-sonnet-4-6',
        max_tokens=1024,
        messages=[{'role': 'user', 'content': prompt}],
    )
    return reply.content[0].text


def _call_gpt4(prompt: str) -> str:
    reply = openai.chat.completions.create(
        model='gpt-4o',
        messages=[{'role': 'user', 'content': prompt}],
    )
    return reply.choices[0].message.content


def _call_haiku(prompt: str) -> str:
    reply = anthropic.messages.create(
        model='claude-haiku-4-5-20251001',
        max_tokens=1024,
        messages=[{'role': 'user', 'content': prompt}],
    )
    return reply.content[0].text


CHAIN = [
    ('sonnet', _call_sonnet),
    ('gpt-4o', _call_gpt4),
    ('haiku', _call_haiku),
]


def resilient_call(prompt: str) -> str:
    last_error = None
    for name, fn in CHAIN:
        try:
            return retry(
                stop=stop_after_attempt(3),
                wait=wait_exponential(multiplier=1, min=2, max=30),
                reraise=True,
            )(fn)(prompt)
        except (APIError, OpenAIError) as exc:
            logging.warning(f'provider {name} failed after retries: {exc}')
            last_error = exc
    raise RuntimeError(f'all providers failed: {last_error}')

Read the chain. Primary is Claude Sonnet. Fallback 1 is GPT-4o (different provider, handles Anthropic outages). Fallback 2 is Claude Haiku (cheaper, handles Sonnet-specific capacity issues on the Anthropic side). Each call is wrapped in its own Tenacity retry.

The ordering matters. Your primary is the highest-quality option. Fallback 1 is the highest-quality cross-provider option so you survive single-provider outages. Fallback 2 is a cheaper option on the primary provider so you survive model-specific capacity issues without a cross-provider dependency.

How do you prevent fallback chain abuse?

The fallback chain is expensive and slow if it fires often. You want to detect when fallbacks are happening frequently and treat that as a signal to investigate, not as a normal operating mode.

2 observability hooks:

Metrics. Increment a Prometheus counter for every fallback transition. Alert if the counter ticks more than, say, 10 times in 5 minutes. That is a sign the primary is having real problems.
Logs. Log every fallback with the failing provider and the exception. Pipe to your observability stack for forensics.

For the full observability picture with Langfuse and Prometheus, see the Langfuse Integration for Agentic AI Tracing post. The metrics side pairs especially well with the circuit breaker pattern in the Circuit Breakers for LLM Calls: Preventing Cascading Failures post.

When should you skip retries entirely?

When the call is already inside a loop that will retry at a higher level. If your agent's main loop retries the whole turn on failure, adding a second retry layer inside the LLM call creates exponential retry explosions: 3 LLM retries * 3 loop retries = 9 total calls for a failure case.

The fix is to pick one retry level, not both. I default to retries at the LLM call layer (closer to the failure, faster recovery) and failure at the loop layer (propagate the error up when the inner retries are exhausted). The opposite pattern works too but it is slower to recover from transient issues.

What to do Monday morning

Wrap your LLM calls with @retry(stop=stop_after_attempt(3), wait=wait_exponential(...)). 5 lines of imports, 3 lines of decoration per call. The single biggest reliability improvement for minimal effort.
Whitelist retryable errors with retry_if_exception_type. Include APIError, APIConnectionError, RateLimitError for Anthropic; OpenAIError subclasses for OpenAI. Exclude auth and validation errors.
Add a fallback chain with at least one cross-provider option. Claude plus GPT-4o, or Claude plus Gemini. Covers single-provider outages.
Instrument the fallback transitions with a Prometheus counter. Alert on high rates; they signal real primary problems that need investigation.
Check that you only retry at one layer, not both. Retries inside retries produce exponential explosions under failure.

The headline: Tenacity retries plus a 3-model fallback chain turn provider outages into transparent degradation. Clients keep working; you handle the outage at the service layer. 60 lines total. Ship it before the next provider incident.

Frequently asked questions

Why do LLM services need retry logic?

Because LLM provider APIs experience transient errors (timeouts, 5xx, connection resets) regularly, and without retries every transient error becomes a user-facing failure. A simple retry policy with exponential backoff converts most transient errors into invisible recoveries. For well-run services, retries lift effective success rate by several percentage points.

How should I configure Tenacity retries for LLM calls?

3 attempts total, exponential backoff starting at 2 seconds with a 30-second cap, and a whitelist of retryable exception types (API errors, connection errors, rate limits). Use reraise=True so the caller sees the original exception if retries are exhausted. Exclude auth and validation errors from the retry list because they will never recover.

What is a fallback chain for LLM calls?

A list of models ordered by preference. You try the primary first with retries. If retries are exhausted, you try the next model in the chain, also with retries. Continue until one succeeds. This survives sustained outages of specific providers or models without requiring manual intervention or user-facing failures.

How do you choose fallback models?

Pick 1 cross-provider fallback and 1 cheaper same-provider fallback. Cross-provider (Claude plus GPT-4o) covers single-provider outages. Same-provider cheaper model (Sonnet plus Haiku) covers model-specific capacity issues without adding a cross-provider dependency. 2 fallbacks is usually enough; more than 3 adds complexity without meaningful uptime gains.

Should I retry at the LLM call layer or the agent loop layer?

Pick one, not both. Retries nested inside retries produce exponential call explosions under failure. I default to retries at the LLM call layer because they recover from transient errors faster. The agent loop layer fails fast and propagates errors up. Either pattern works; nested retries does not.

Key takeaways

LLM provider APIs have regular transient failures. Without retries, every transient failure becomes a user-facing error. Retries are not optional.
Use Tenacity with 3 attempts, exponential backoff, and a whitelist of retryable exceptions. Auth and validation errors should fail fast.
Build a fallback chain with at least one cross-provider option. This turns single-provider outages into invisible degradations.
Pick fallbacks carefully: one cross-provider, one cheaper same-provider. More than 3 fallbacks is usually complexity without benefit.
Retry at one layer only. Nested retries cause exponential call explosions under sustained failure.
To see retries and fallbacks wired into a full production agent stack with observability and cost control, walk through the Build Your Own Coding Agent course, or start with the AI Agents Fundamentals primer.

For the full Tenacity documentation, retry strategies, and advanced usage like custom retry conditions, see the Tenacity docs. The exponential backoff with jitter patterns there are worth adopting for high-concurrency workloads.

Your agent goes down every time OpenAI has a bad afternoon

Why do LLM calls fail in production?

3 main failure modes, each needing a different response.

Transient errors. 5xx from the provider, network hiccups, timeouts. The call would work if retried in a second. Most outages have bursts of these.
Rate limits. 429 from the provider when you exceed your per-minute token or request quota. Retrying immediately makes it worse; backing off helps.
Sustained outages. The provider is down for 30 minutes. Retries do not help because there is nothing to retry against. You need a fallback.

graph TD
    Call[LLM call] --> Primary[Primary: Sonnet]
    Primary -->|transient error| Retry[Retry with backoff]
    Retry -->|success| Done[Return]
    Retry -->|still failing after 3 tries| Fallback1[Fallback 1: GPT-4]
    Fallback1 -->|transient error| Retry1[Retry]
    Retry1 -->|still failing| Fallback2[Fallback 2: Haiku]
    Fallback2 -->|works| Done

    style Primary fill:#dbeafe,stroke:#1e40af
    style Fallback1 fill:#fef3c7,stroke:#b45309
    style Fallback2 fill:#fef3c7,stroke:#b45309
    style Done fill:#dcfce7,stroke:#15803d

The primary path covers the happy case. The retry path covers transient errors. The fallback path covers sustained outages of a specific provider.

How do you configure Tenacity for LLM retries?

Tenacity is the standard Python retry library. 3 decorators you need: @retry, stop_after_attempt, and wait_exponential. Wrap them around any LLM call.

# filename: retry.py
# description: Retry an LLM call with exponential backoff.
# Transient and rate-limit errors retry; auth errors do not.
from tenacity import (
    retry, stop_after_attempt, wait_exponential,
    retry_if_exception_type,
)
from anthropic import Anthropic, APIError, APIConnectionError, RateLimitError

client = Anthropic()


@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=30),
    retry=retry_if_exception_type((APIConnectionError, APIError, RateLimitError)),
    reraise=True,
)
def call_sonnet(prompt: str) -> str:
    reply = client.messages.create(
        model='claude-sonnet-4-6',
        max_tokens=1024,
        messages=[{'role': 'user', 'content': prompt}],
    )
    return reply.content[0].text

These 4 settings are the ones that matter. Tune them based on your latency budget and the provider's typical recovery time. The defaults above work well for most workloads.

What errors should you not retry?

3 categories of errors that should fail fast, not retry.

Authentication errors (401). Your API key is wrong or revoked. Retrying will not help and every retry wastes time.
Validation errors (400). Your request body is malformed or the prompt triggered a content filter. Retrying sends the same bad request.
Permission errors (403). The model is not available for your account or the request violates usage policy. Not a transient condition.

How do you build a fallback chain?

A list of models, each with its own retry policy. Try the primary. If it fails after retries, try the next. Continue until one works or all fall back.

# filename: fallback.py
# description: A fallback chain of LLM providers. Try primary with retries,
# then the next model, then the next, until one succeeds.
import logging
from tenacity import retry, stop_after_attempt, wait_exponential
from anthropic import Anthropic, APIError
from openai import OpenAI, OpenAIError

anthropic = Anthropic()
openai = OpenAI()


def _call_sonnet(prompt: str) -> str:
    reply = anthropic.messages.create(
        model='claude-sonnet-4-6',
        max_tokens=1024,
        messages=[{'role': 'user', 'content': prompt}],
    )
    return reply.content[0].text


def _call_gpt4(prompt: str) -> str:
    reply = openai.chat.completions.create(
        model='gpt-4o',
        messages=[{'role': 'user', 'content': prompt}],
    )
    return reply.choices[0].message.content


def _call_haiku(prompt: str) -> str:
    reply = anthropic.messages.create(
        model='claude-haiku-4-5-20251001',
        max_tokens=1024,
        messages=[{'role': 'user', 'content': prompt}],
    )
    return reply.content[0].text


CHAIN = [
    ('sonnet', _call_sonnet),
    ('gpt-4o', _call_gpt4),
    ('haiku', _call_haiku),
]


def resilient_call(prompt: str) -> str:
    last_error = None
    for name, fn in CHAIN:
        try:
            return retry(
                stop=stop_after_attempt(3),
                wait=wait_exponential(multiplier=1, min=2, max=30),
                reraise=True,
            )(fn)(prompt)
        except (APIError, OpenAIError) as exc:
            logging.warning(f'provider {name} failed after retries: {exc}')
            last_error = exc
    raise RuntimeError(f'all providers failed: {last_error}')

How do you prevent fallback chain abuse?

The fallback chain is expensive and slow if it fires often. You want to detect when fallbacks are happening frequently and treat that as a signal to investigate, not as a normal operating mode.

2 observability hooks:

Metrics. Increment a Prometheus counter for every fallback transition. Alert if the counter ticks more than, say, 10 times in 5 minutes. That is a sign the primary is having real problems.
Logs. Log every fallback with the failing provider and the exception. Pipe to your observability stack for forensics.

When should you skip retries entirely?

What to do Monday morning

Wrap your LLM calls with @retry(stop=stop_after_attempt(3), wait=wait_exponential(...)). 5 lines of imports, 3 lines of decoration per call. The single biggest reliability improvement for minimal effort.
Whitelist retryable errors with retry_if_exception_type. Include APIError, APIConnectionError, RateLimitError for Anthropic; OpenAIError subclasses for OpenAI. Exclude auth and validation errors.
Add a fallback chain with at least one cross-provider option. Claude plus GPT-4o, or Claude plus Gemini. Covers single-provider outages.
Instrument the fallback transitions with a Prometheus counter. Alert on high rates; they signal real primary problems that need investigation.
Check that you only retry at one layer, not both. Retries inside retries produce exponential explosions under failure.

LLM provider APIs have regular transient failures. Without retries, every transient failure becomes a user-facing error. Retries are not optional.
Use Tenacity with 3 attempts, exponential backoff, and a whitelist of retryable exceptions. Auth and validation errors should fail fast.
Build a fallback chain with at least one cross-provider option. This turns single-provider outages into invisible degradations.
Pick fallbacks carefully: one cross-provider, one cheaper same-provider. More than 3 fallbacks is usually complexity without benefit.
Retry at one layer only. Nested retries cause exponential call explosions under sustained failure.
To see retries and fallbacks wired into a full production agent stack with observability and cost control, walk through the Build Your Own Coding Agent course, or start with the AI Agents Fundamentals primer.

Resilient LLM services with Tenacity and fallback models

Share this post

Share this post

Continue Reading

Which language should you build Redis in? Lessons from rebuilding it 6 times

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Weekly Bytes of AI

Ready to go deeper?

Resilient LLM services with Tenacity and fallback models

Share this post

Share this post

Continue Reading

Which language should you build Redis in? Lessons from rebuilding it 6 times

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Weekly Bytes of AI

Ready to go deeper?