Circuit breakers for LLM calls: prevent cascades

When the LLM provider went down, your whole agent went with it

OpenAI had a 20-minute incident. Your service has retry logic with 3 attempts and exponential backoff. Every request that came in during the incident retried 3 times, waited 2+4+8 seconds on backoff, and eventually timed out. Each failed request consumed a Uvicorn worker for roughly 14 seconds. Your normal capacity of 4 workers was pinned on failing requests. Every new user got a 504 because there were no free workers. The provider outage became an outage of your own service, even for things that did not need the LLM.

This is the cascade, and it is what retry logic alone cannot prevent. Retries are the right answer for a 2-second blip. They are the wrong answer for a 20-minute outage because they amplify the problem instead of containing it. The fix is a circuit breaker: a small state machine that notices when the LLM is failing and fails fast for the duration of the outage, freeing your workers to serve other traffic.

This post is the circuit breaker pattern adapted for LLM calls, the 3 states (closed, open, half-open), the failure window that decides when to trip, and the 50-line implementation that works in a single Python service without external dependencies.

Why do retries alone make outages worse?

Because retries turn a single failure into a multiplied failure. 3 concrete amplification effects:

Capacity exhaustion. Every retrying request holds a worker for the duration of all its retries. With 3 retries and exponential backoff, a single failing request can hold a worker for 14+ seconds. Your worker count becomes your failure amplification factor.
Thundering herd. When the provider recovers, every retry fires simultaneously because they all hit the end of their backoff at the same time. The provider immediately hits its per-second rate limit and fails again. You do not recover.
Retry budget waste. You retry every call during the outage. Each retry burns real money at the provider (it still counts the failed call for billing in some cases) and real latency on your side.

Retries are the right tool for a 2-second blip and the wrong tool for a 20-minute outage. A circuit breaker decides which situation you are in and switches strategies automatically.

graph LR
    Closed[Closed: calls pass through] -->|threshold failures| Open[Open: fail fast]
    Open -->|timeout expires| Half[Half-open: try one call]
    Half -->|success| Closed
    Half -->|failure| Open

    style Closed fill:#dcfce7,stroke:#15803d
    style Open fill:#fee2e2,stroke:#b91c1c
    style Half fill:#fef3c7,stroke:#b45309

3 states. Closed is normal. Open is "provider is down, do not call it." Half-open is "let one test call through to see if the provider recovered." The transitions are driven by observed failures and a recovery timeout.

What are the 3 states of a circuit breaker?

Closed

The normal operating state. Every call passes through to the provider. The breaker counts failures in a rolling window. If the failure count exceeds a threshold, the breaker transitions to open.

Failure threshold: I typically use "5 failures within 60 seconds." Below 5, transient issues are handled by retry logic. Above 5 in 60 seconds, something is systemically wrong and the provider should be considered down.

Open

The breaker is tripped. Every call fails immediately without touching the provider. The failure is a specific exception type that the caller can recognize and handle (usually by returning a friendly degraded response to the user).

The breaker stays open for a fixed cooldown period (typically 30 seconds). During the cooldown, your workers are not tied up waiting for the provider. They can serve other traffic instantly.

Half-open

When the cooldown expires, the breaker transitions to half-open. The next single call is allowed through. If it succeeds, the breaker transitions back to closed. If it fails, the breaker transitions back to open and the cooldown restarts.

Only one call is allowed in half-open at a time. Subsequent calls during the half-open window fail fast like in the open state. This prevents a thundering herd of test calls when the cooldown expires.

How do you implement a circuit breaker in Python?

A class with a state field, a failure counter, a threshold, and a cooldown timer. Wrap LLM calls with the breaker; the breaker decides whether to pass the call through or fail fast.

# filename: circuit_breaker.py
# description: A minimal circuit breaker for LLM calls. Tracks failures
# in a rolling window, trips open on threshold, recovers via half-open.
import time
from dataclasses import dataclass, field
from typing import Callable, Any


class CircuitOpenError(Exception):
    """Raised when the breaker is open and a call is blocked."""


@dataclass
class CircuitBreaker:
    failure_threshold: int = 5
    window_seconds: float = 60.0
    cooldown_seconds: float = 30.0
    state: str = 'closed'
    failures: list[float] = field(default_factory=list)
    opened_at: float = 0.0

    def _prune(self) -> None:
        cutoff = time.time() - self.window_seconds
        self.failures = [t for t in self.failures if t > cutoff]

    def _record_failure(self) -> None:
        self.failures.append(time.time())
        self._prune()
        if len(self.failures) >= self.failure_threshold:
            self.state = 'open'
            self.opened_at = time.time()

    def _record_success(self) -> None:
        if self.state == 'half-open':
            self.state = 'closed'
            self.failures = []

    def call(self, fn: Callable[..., Any], *args, **kwargs) -> Any:
        if self.state == 'open':
            if time.time() - self.opened_at >= self.cooldown_seconds:
                self.state = 'half-open'
            else:
                raise CircuitOpenError('circuit is open')

        try:
            result = fn(*args, **kwargs)
        except Exception:
            self._record_failure()
            raise

        self._record_success()
        return result

50 lines for the whole state machine. No external dependencies. Thread-safe only if your Uvicorn workers do not share memory (which they do not by default), so each worker has its own breaker. For shared-state breakers across workers, use Redis as a backend (same pattern as the token bucket in the Rate Limiting FastAPI Agents: Token Buckets in Production post).

How does the circuit breaker compose with retries?

The retry layer runs inside the breaker layer. The breaker decides whether to attempt the call at all. If the breaker is closed, the call runs and may internally retry on transient failures. If the breaker is open, the call fails immediately without retries.

# filename: composed.py
# description: Compose retry and circuit breaker. Breaker wraps retry,
# so when the breaker is open, retries do not fire at all.
from circuit_breaker import CircuitBreaker, CircuitOpenError
from tenacity import retry, stop_after_attempt, wait_exponential

breaker = CircuitBreaker()


@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=30),
    reraise=True,
)
def _call_llm(prompt: str) -> str:
    # real LLM call here
    ...


def safe_llm_call(prompt: str) -> str:
    try:
        return breaker.call(_call_llm, prompt)
    except CircuitOpenError:
        return '[service temporarily unavailable, please try again shortly]'

When the breaker is closed, retries kick in and handle transient errors. When the breaker is open, calls fail instantly and return a graceful degradation string. The user sees a clear message instead of a hung request, and your workers stay free.

How do you handle the circuit-open response?

The graceful degradation matters. 3 ways to handle a circuit-open state:

Return a friendly message. "The AI service is temporarily unavailable. Please try again in a moment." Better than a 500, better than a hung request.
Fall back to a different provider. If you also have a fallback chain from the Resilient LLM Services with Tenacity and Fallback Models post, the breaker opening is the signal to jump to the fallback.
Queue the request for later. For non-real-time workflows, store the request and process it when the breaker closes. Not appropriate for chat-style interactions but great for batch workloads.

My default is option 1 for user-facing chat and option 2 for agent pipelines that have a fallback provider. Option 3 is niche but powerful for batch processing.

When should the breaker stay open longer than 30 seconds?

When the outage is known to be long. Most provider status pages post ETAs for ongoing incidents. A wise move is to lengthen the cooldown to match the ETA when you see one. A 20-minute posted ETA means opening the breaker for 5 minutes initially and then checking more often as the ETA approaches.

For unmonitored outages (no status page update), stick with the default 30-second cooldown. The half-open probe will catch recovery quickly enough, and the downside of a short cooldown is just an extra test call every 30 seconds during the outage.

For the broader observability picture that tells you when the breaker is tripping, see the Langfuse Integration for Agentic AI Tracing post. For the full production stack context, the Build Your Own Coding Agent course covers circuit breakers alongside retries, fallbacks, and observability.

What to do Monday morning

Drop the CircuitBreaker class from this post into your service. It is 50 lines with no dependencies.
Wrap your LLM calls with breaker.call(fn, ...). Start with one call site so you can measure the impact before rolling it out.
Set the failure threshold to 5 in 60 seconds and the cooldown to 30 seconds. These are good defaults; tune only if you have specific needs.
Add a graceful degradation handler for CircuitOpenError. Friendly message for users, fallback model for pipelines. Never let the exception escape as a 500.
Add a Prometheus counter on breaker state transitions and alert on frequent open transitions. Those are the signal that a real outage is happening.

The headline: a circuit breaker is 50 lines of state machine that converts a 20-minute outage from "your service is down" into "your service shows a polite message for 20 minutes." Ship it before the next provider incident, not during.

Frequently asked questions

What is a circuit breaker in software systems?

A circuit breaker is a state machine that monitors failures against a downstream service and blocks calls when failures exceed a threshold. It has 3 states: closed (calls pass through), open (calls fail fast), and half-open (one probe call to test recovery). The pattern prevents cascading failures by freeing worker capacity during downstream outages instead of letting requests pile up on retries.

Why do LLM services need circuit breakers in addition to retries?

Because retries only help for brief transient failures. For sustained outages (minutes to hours), retries amplify the problem by holding workers in the retry loop and creating thundering herds. A circuit breaker detects sustained failures and fails fast for the duration, freeing capacity to serve other traffic and preventing the downstream outage from cascading into your own service.

How should I configure circuit breaker thresholds for LLM calls?

Start with 5 failures in a 60-second window and a 30-second cooldown. Adjust based on your traffic: high-volume services can use shorter windows and tighter thresholds; low-volume services need longer windows to accumulate enough signal. The key property is that the threshold is triggered by sustained failure, not by 1 or 2 transient errors.

How do circuit breakers compose with retry logic?

The circuit breaker wraps the retry. When the breaker is closed, retries handle transient errors normally. When the breaker is open, calls fail immediately without entering the retry loop at all. This composition prevents retries from happening during a sustained outage, which is the combination most teams miss.

Should circuit breakers be shared across workers or per-worker?

Per-worker is the default because it has no external dependencies and is simple to implement. Shared breakers using Redis are useful when you want coordinated failure detection across workers but add complexity. Most services can start with per-worker breakers and add shared state only if the per-worker version proves insufficient.

Key takeaways

Retries alone cannot handle sustained provider outages. They amplify the problem by holding workers and creating thundering herds when the provider recovers.
A circuit breaker is a 3-state machine: closed (normal), open (fail fast), half-open (probe for recovery). 50 lines of Python.
Default thresholds: 5 failures in 60 seconds to open, 30 seconds cooldown before trying a single half-open call.
Breaker wraps retry. When closed, retries handle transient errors. When open, calls fail fast and do not retry at all.
Always provide a graceful degradation response for CircuitOpenError. Friendly message for users, fallback provider for pipelines.
To see circuit breakers wired into a full production agent stack with retries, fallbacks, and observability, walk through the Build Your Own Coding Agent course, or start with the AI Agents Fundamentals primer.

For the original circuit breaker pattern and its role in the larger microservices resilience story, see Martin Fowler's Circuit Breaker article. The state machine in this post is the same one Fowler describes, adapted for LLM call semantics.

When the LLM provider went down, your whole agent went with it

Why do retries alone make outages worse?

Because retries turn a single failure into a multiplied failure. 3 concrete amplification effects:

Capacity exhaustion. Every retrying request holds a worker for the duration of all its retries. With 3 retries and exponential backoff, a single failing request can hold a worker for 14+ seconds. Your worker count becomes your failure amplification factor.
Thundering herd. When the provider recovers, every retry fires simultaneously because they all hit the end of their backoff at the same time. The provider immediately hits its per-second rate limit and fails again. You do not recover.
Retry budget waste. You retry every call during the outage. Each retry burns real money at the provider (it still counts the failed call for billing in some cases) and real latency on your side.

Retries are the right tool for a 2-second blip and the wrong tool for a 20-minute outage. A circuit breaker decides which situation you are in and switches strategies automatically.

graph LR
    Closed[Closed: calls pass through] -->|threshold failures| Open[Open: fail fast]
    Open -->|timeout expires| Half[Half-open: try one call]
    Half -->|success| Closed
    Half -->|failure| Open

    style Closed fill:#dcfce7,stroke:#15803d
    style Open fill:#fee2e2,stroke:#b91c1c
    style Half fill:#fef3c7,stroke:#b45309

# filename: circuit_breaker.py
# description: A minimal circuit breaker for LLM calls. Tracks failures
# in a rolling window, trips open on threshold, recovers via half-open.
import time
from dataclasses import dataclass, field
from typing import Callable, Any


class CircuitOpenError(Exception):
    """Raised when the breaker is open and a call is blocked."""


@dataclass
class CircuitBreaker:
    failure_threshold: int = 5
    window_seconds: float = 60.0
    cooldown_seconds: float = 30.0
    state: str = 'closed'
    failures: list[float] = field(default_factory=list)
    opened_at: float = 0.0

    def _prune(self) -> None:
        cutoff = time.time() - self.window_seconds
        self.failures = [t for t in self.failures if t > cutoff]

    def _record_failure(self) -> None:
        self.failures.append(time.time())
        self._prune()
        if len(self.failures) >= self.failure_threshold:
            self.state = 'open'
            self.opened_at = time.time()

    def _record_success(self) -> None:
        if self.state == 'half-open':
            self.state = 'closed'
            self.failures = []

    def call(self, fn: Callable[..., Any], *args, **kwargs) -> Any:
        if self.state == 'open':
            if time.time() - self.opened_at >= self.cooldown_seconds:
                self.state = 'half-open'
            else:
                raise CircuitOpenError('circuit is open')

        try:
            result = fn(*args, **kwargs)
        except Exception:
            self._record_failure()
            raise

        self._record_success()
        return result

How does the circuit breaker compose with retries?

# filename: composed.py
# description: Compose retry and circuit breaker. Breaker wraps retry,
# so when the breaker is open, retries do not fire at all.
from circuit_breaker import CircuitBreaker, CircuitOpenError
from tenacity import retry, stop_after_attempt, wait_exponential

breaker = CircuitBreaker()


@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=30),
    reraise=True,
)
def _call_llm(prompt: str) -> str:
    # real LLM call here
    ...


def safe_llm_call(prompt: str) -> str:
    try:
        return breaker.call(_call_llm, prompt)
    except CircuitOpenError:
        return '[service temporarily unavailable, please try again shortly]'

How do you handle the circuit-open response?

The graceful degradation matters. 3 ways to handle a circuit-open state:

Return a friendly message. "The AI service is temporarily unavailable. Please try again in a moment." Better than a 500, better than a hung request.
Fall back to a different provider. If you also have a fallback chain from the Resilient LLM Services with Tenacity and Fallback Models post, the breaker opening is the signal to jump to the fallback.
Queue the request for later. For non-real-time workflows, store the request and process it when the breaker closes. Not appropriate for chat-style interactions but great for batch workloads.

My default is option 1 for user-facing chat and option 2 for agent pipelines that have a fallback provider. Option 3 is niche but powerful for batch processing.

When should the breaker stay open longer than 30 seconds?

What to do Monday morning

Drop the CircuitBreaker class from this post into your service. It is 50 lines with no dependencies.
Wrap your LLM calls with breaker.call(fn, ...). Start with one call site so you can measure the impact before rolling it out.
Set the failure threshold to 5 in 60 seconds and the cooldown to 30 seconds. These are good defaults; tune only if you have specific needs.
Add a graceful degradation handler for CircuitOpenError. Friendly message for users, fallback model for pipelines. Never let the exception escape as a 500.
Add a Prometheus counter on breaker state transitions and alert on frequent open transitions. Those are the signal that a real outage is happening.

Retries alone cannot handle sustained provider outages. They amplify the problem by holding workers and creating thundering herds when the provider recovers.
A circuit breaker is a 3-state machine: closed (normal), open (fail fast), half-open (probe for recovery). 50 lines of Python.
Default thresholds: 5 failures in 60 seconds to open, 30 seconds cooldown before trying a single half-open call.
Breaker wraps retry. When closed, retries handle transient errors. When open, calls fail fast and do not retry at all.
Always provide a graceful degradation response for CircuitOpenError. Friendly message for users, fallback provider for pipelines.
To see circuit breakers wired into a full production agent stack with retries, fallbacks, and observability, walk through the Build Your Own Coding Agent course, or start with the AI Agents Fundamentals primer.

Circuit breakers for LLM calls: stop cascading failures

Share this post

Share this post

Continue Reading

Which language should you build Redis in? Lessons from rebuilding it 6 times

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Weekly Bytes of AI

Ready to go deeper?

Circuit breakers for LLM calls: stop cascading failures

Share this post

Share this post

Continue Reading

Which language should you build Redis in? Lessons from rebuilding it 6 times

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Weekly Bytes of AI

Ready to go deeper?