AI Agent Resilience in Production: Retry, Fallback, and Circuit Breaker Patterns with Python

Your AI agent works in dev but crashes in production. Learn how to implement retry with exponential backoff, multi-provider fallback chains, and circuit breakers in Python to build resilient AI agents that handle LLM API failures gracefully.

Your AI agent demo works perfectly. It chains tool calls, reasons through ambiguity, and returns beautifully structured results. Then you deploy it, and the first thing that happens is OpenAI returns a 429, your fallback prompt throws a validation error, and the entire pipeline crashes at 2 AM with nobody watching.

Sound familiar? I've been there — more times than I'd like to admit. Production AI agents fail in ways that traditional software rarely does. LLM APIs are statistically unreliable — rate limits, timeouts, content policy rejections, context window overflows, and model version changes aren't edge cases. They're guaranteed failure modes that happen every single day at scale. A 2026 study of enterprise AI deployments found that agents achieve only 60% success on single runs, dropping to 25% across eight consecutive runs without resilience engineering.

This guide covers the three essential resilience patterns — retries with exponential backoff, multi-provider fallbacks, and circuit breakers — with production-ready Python code you can deploy today. We'll also dig into error classification, graceful degradation, and how to layer these patterns into a defense-in-depth architecture that keeps your agents running when everything around them is breaking.

Why LLM APIs Fail Differently Than Traditional APIs

Before diving into patterns, it's worth understanding why AI agent resilience requires different thinking than standard API error handling. Traditional REST APIs fail in predictable ways — a database is down, a service is unreachable, authentication expires. LLM APIs? They introduce a whole new category of failure modes that are fundamentally non-deterministic.

The Unique Failure Modes of LLM APIs

  • Rate limiting (HTTP 429) — Every major provider throttles requests based on tokens-per-minute and requests-per-minute quotas. Under burst traffic, you'll hit these limits constantly.
  • Context window overflow — Your agent accumulates tool call results, conversation history, and retrieved documents until the request exceeds the model's token limit. This one's a silent killer in long-running agent loops.
  • Content policy rejections — The model refuses to process certain inputs due to safety filters. This can happen unpredictably based on input phrasing and varies between providers.
  • Timeout and latency spikes — LLM inference is computationally expensive. Complex reasoning tasks can take 30-60 seconds, and provider infrastructure can add unpredictable latency during peak hours.
  • Model version drift — Providers update models without warning. A prompt that produced perfectly structured JSON yesterday may return a subtly different format after a model update.
  • Partial or malformed responses — The model may return truncated output, invalid JSON, or hallucinated field names that break your downstream parsing.

The critical insight here is that these failures are not bugs to eliminate — they're operational realities to engineer around. Resilience isn't optional for production AI agents. It's a core architectural requirement.

Pattern 1: Retries with Exponential Backoff

Retries are your first line of defense against transient failures. The key principle is simple: not all errors deserve a retry, and retries without backoff will make things worse.

Error Classification: Retry or Fail Fast?

Before implementing retries, you need to classify errors correctly. Retrying a permanent failure wastes time and money. Failing fast on a transient error loses a request that would've succeeded on the second try.

from enum import Enum
from openai import (
    APIError,
    APIConnectionError,
    RateLimitError,
    APITimeoutError,
    AuthenticationError,
    BadRequestError,
    ContentFilterFinishReasonError
)

class ErrorSeverity(Enum):
    TRANSIENT = "transient"      # Retry with backoff
    PERMANENT = "permanent"      # Fail immediately
    DEGRADED = "degraded"        # Switch to fallback

def classify_llm_error(error: Exception) -> ErrorSeverity:
    """Classify an LLM API error to determine the correct recovery strategy."""
    if isinstance(error, (RateLimitError, APITimeoutError, APIConnectionError)):
        return ErrorSeverity.TRANSIENT
    if isinstance(error, AuthenticationError):
        return ErrorSeverity.PERMANENT
    if isinstance(error, BadRequestError):
        if "context_length_exceeded" in str(error):
            return ErrorSeverity.DEGRADED
        return ErrorSeverity.PERMANENT
    if isinstance(error, ContentFilterFinishReasonError):
        return ErrorSeverity.DEGRADED
    if isinstance(error, APIError) and error.status_code in (500, 502, 503):
        return ErrorSeverity.TRANSIENT
    return ErrorSeverity.PERMANENT

This classifier drives all downstream decisions. Transient errors get retried. Permanent errors fail fast. Degraded errors trigger a fallback to an alternative model or strategy. Honestly, getting this classification right is probably the single most impactful thing you can do for your agent's reliability.

Production Retries with Tenacity

The Tenacity library (v9.1.4, February 2026) is the standard for retry logic in Python. Here's a production-ready retry wrapper for LLM API calls that handles the specific failure modes we classified above.

from tenacity import (
    retry,
    retry_if_exception_type,
    stop_after_attempt,
    wait_exponential_jitter,
    before_sleep_log,
    after_log,
)
import logging
import openai

logger = logging.getLogger(__name__)

@retry(
    retry=retry_if_exception_type((
        openai.RateLimitError,
        openai.APITimeoutError,
        openai.APIConnectionError,
    )),
    wait=wait_exponential_jitter(
        initial=1,       # Start at 1 second
        max=60,           # Cap at 60 seconds
        jitter=5,         # Add up to 5 seconds of random jitter
    ),
    stop=stop_after_attempt(5),
    before_sleep=before_sleep_log(logger, logging.WARNING),
    after=after_log(logger, logging.INFO),
    reraise=True,
)
def call_llm_with_retry(
    client: openai.OpenAI,
    model: str,
    messages: list[dict],
    **kwargs
) -> openai.types.chat.ChatCompletion:
    """Call an LLM API with automatic retry on transient failures."""
    return client.chat.completions.create(
        model=model,
        messages=messages,
        timeout=30,
        **kwargs,
    )

A few things to note about this implementation:

  • Jittered exponential backoff — The wait_exponential_jitter strategy prevents the thundering herd problem. When multiple agent instances hit a rate limit simultaneously, pure exponential backoff causes them all to retry at the exact same intervals, creating repeated collision spikes. Adding random jitter spreads retries across time.
  • Selective retry — We only retry on RateLimitError, APITimeoutError, and APIConnectionError. Authentication failures, bad requests, and content policy violations aren't retried because they'll never succeed on a second attempt.
  • Structured logging — The before_sleep_log and after_log callbacks give you observability into retry behavior. In production, this data feeds your monitoring dashboards to identify systemic issues versus transient blips.
  • Hard timeout per request — The timeout=30 prevents any single LLM call from hanging indefinitely, which is critical in agent loops where a stuck call blocks the entire pipeline.

Async Retry for High-Throughput Agents

If your agent processes multiple requests concurrently, you'll want the async variant to avoid blocking the event loop during backoff waits.

from tenacity import (
    AsyncRetrying,
    retry_if_exception_type,
    stop_after_attempt,
    wait_exponential_jitter,
)
import openai

async def call_llm_async(
    client: openai.AsyncOpenAI,
    model: str,
    messages: list[dict],
    **kwargs,
) -> openai.types.chat.ChatCompletion:
    """Async LLM call with retry — non-blocking during backoff."""
    async for attempt in AsyncRetrying(
        retry=retry_if_exception_type((
            openai.RateLimitError,
            openai.APITimeoutError,
            openai.APIConnectionError,
        )),
        wait=wait_exponential_jitter(initial=1, max=60, jitter=5),
        stop=stop_after_attempt(5),
        reraise=True,
    ):
        with attempt:
            return await client.chat.completions.create(
                model=model,
                messages=messages,
                timeout=30,
                **kwargs,
            )

Pattern 2: Multi-Provider Fallback Chains

Retries handle transient failures within a single provider. But what happens when the entire provider goes down, or when a content policy rejection is provider-specific, or when you need a model with a larger context window? That's where fallback chains come in — they route requests to alternative providers automatically.

Manual Fallback Implementation

Here's a clean, dependency-free fallback chain that you can customize for any combination of providers and models.

import openai
import anthropic
from dataclasses import dataclass

@dataclass
class ModelConfig:
    provider: str
    model: str
    max_tokens: int
    client: object

class LLMFallbackChain:
    """Routes LLM requests through a prioritized chain of providers."""

    def __init__(self, models: list[ModelConfig]):
        self.models = models

    def call(self, messages: list[dict], **kwargs) -> dict:
        errors = []
        for config in self.models:
            try:
                if config.provider == "openai":
                    return self._call_openai(config, messages, **kwargs)
                elif config.provider == "anthropic":
                    return self._call_anthropic(config, messages, **kwargs)
            except Exception as e:
                severity = classify_llm_error(e)
                errors.append({"model": config.model, "error": str(e)})
                if severity == ErrorSeverity.PERMANENT:
                    raise  # Don't fallback on auth errors, etc.
                continue  # Try next provider

        raise RuntimeError(
            f"All {len(self.models)} providers failed: {errors}"
        )

    def _call_openai(self, config, messages, **kwargs):
        response = config.client.chat.completions.create(
            model=config.model,
            messages=messages,
            max_tokens=config.max_tokens,
            timeout=30,
            **kwargs,
        )
        return {
            "content": response.choices[0].message.content,
            "model": config.model,
            "provider": "openai",
            "usage": {
                "prompt_tokens": response.usage.prompt_tokens,
                "completion_tokens": response.usage.completion_tokens,
            },
        }

    def _call_anthropic(self, config, messages, **kwargs):
        # Convert OpenAI message format to Anthropic format
        system_msg = next(
            (m["content"] for m in messages if m["role"] == "system"), None
        )
        user_msgs = [m for m in messages if m["role"] != "system"]

        response = config.client.messages.create(
            model=config.model,
            max_tokens=config.max_tokens,
            system=system_msg or "",
            messages=user_msgs,
            timeout=30,
        )
        return {
            "content": response.content[0].text,
            "model": config.model,
            "provider": "anthropic",
            "usage": {
                "prompt_tokens": response.usage.input_tokens,
                "completion_tokens": response.usage.output_tokens,
            },
        }

# Usage
chain = LLMFallbackChain([
    ModelConfig("openai", "gpt-4.1", 4096, openai.OpenAI()),
    ModelConfig("anthropic", "claude-sonnet-4-20250514", 4096, anthropic.Anthropic()),
    ModelConfig("openai", "gpt-4.1-mini", 4096, openai.OpenAI()),
])

This gives you full control over provider routing, error classification per hop, and response normalization. The fallback order is explicit: try the primary model, fall back to a different provider, then fall back to a cheaper model as a last resort.

LiteLLM: Fallbacks with Less Code

If you don't need custom routing logic, LiteLLM provides multi-provider fallbacks out of the box with a unified OpenAI-compatible interface. It's honestly the fastest way to get multi-provider resilience up and running.

import litellm

# Simple fallback — if GPT-4.1 fails, try Claude automatically
response = litellm.completion(
    model="gpt-4.1",
    messages=[{"role": "user", "content": "Explain circuit breakers in distributed systems"}],
    fallbacks=["anthropic/claude-sonnet-4-20250514", "gpt-4.1-mini"],
    num_retries=2,
)

print(f"Response from: {response.model}")
print(response.choices[0].message.content)

LiteLLM also supports context window fallbacks — automatically routing to a model with a larger context when the primary model's limit is exceeded — and content policy fallbacks for provider-specific safety filter differences.

from litellm import Router

router = Router(
    model_list=[
        {"model_name": "primary", "litellm_params": {"model": "gpt-4.1", "order": 1}},
        {"model_name": "primary", "litellm_params": {"model": "anthropic/claude-sonnet-4-20250514", "order": 2}},
    ],
    context_window_fallbacks=[
        {"primary": ["anthropic/claude-sonnet-4-20250514"]}
    ],
    enable_pre_call_checks=True,
    num_retries=2,
)

response = router.completion(
    model="primary",
    messages=[{"role": "user", "content": "Analyze this document..."}],
)

Pattern 3: Circuit Breakers

Retries and fallbacks handle individual request failures. Circuit breakers solve a different problem entirely: what happens when a provider is down for 10 minutes and every single request in your system spends 30 seconds retrying before failing?

Without a circuit breaker, you're burning compute resources, your users are waiting unnecessarily, and the downstream provider can't recover because it's being hammered with retry traffic. A circuit breaker monitors failure rates over time and "opens the circuit" when failures exceed a threshold — immediately rejecting requests instead of attempting them.

Circuit Breaker States

The pattern operates in three states (analogous to an electrical circuit breaker):

  1. CLOSED (normal operation) — Requests pass through normally. The breaker tracks consecutive failures.
  2. OPEN (fast failure) — After hitting the failure threshold, all requests immediately fail with a CircuitBreakerError without contacting the provider. This gives the provider time to recover.
  3. HALF-OPEN (testing recovery) — After a timeout period, the breaker allows a single test request through. If it succeeds, the circuit closes. If it fails, the circuit reopens.

Implementation with PyBreaker

The PyBreaker library provides a clean circuit breaker implementation for Python. Here's how to wrap your LLM calls with circuit protection.

import pybreaker
import openai
import logging

logger = logging.getLogger(__name__)

# Listener for observability — push circuit state changes to your metrics
class LLMBreakerListener(pybreaker.CircuitBreakerListener):
    def state_change(self, cb, old_state, new_state):
        logger.warning(
            f"Circuit breaker '{cb.name}' state change: {old_state.name} -> {new_state.name}"
        )
        # Push to your metrics system (Prometheus, Datadog, etc.)

    def failure(self, cb, exc):
        logger.error(f"Circuit breaker '{cb.name}' recorded failure: {exc}")

    def success(self, cb):
        logger.info(f"Circuit breaker '{cb.name}' recorded success")


# Create a breaker per provider — do NOT share breakers across providers
openai_breaker = pybreaker.CircuitBreaker(
    fail_max=5,             # Open after 5 consecutive failures
    reset_timeout=60,       # Try again after 60 seconds
    success_threshold=2,    # Require 2 successes before fully closing
    name="openai-gpt4",
    listeners=[LLMBreakerListener()],
)

anthropic_breaker = pybreaker.CircuitBreaker(
    fail_max=5,
    reset_timeout=60,
    success_threshold=2,
    name="anthropic-claude",
    listeners=[LLMBreakerListener()],
)

@openai_breaker
def call_openai(client: openai.OpenAI, messages: list[dict], **kwargs):
    return client.chat.completions.create(
        model="gpt-4.1",
        messages=messages,
        timeout=30,
        **kwargs,
    )

@anthropic_breaker
def call_anthropic(client, messages: list[dict], **kwargs):
    return client.messages.create(
        model="claude-sonnet-4-20250514",
        messages=messages,
        max_tokens=4096,
        timeout=30,
    )

A few critical implementation details worth calling out:

  • One breaker per provider — Never share a single circuit breaker across multiple providers. If OpenAI is down, you don't want the breaker to block Anthropic calls too.
  • Success threshold — Requiring 2 successful requests before fully closing the circuit prevents a single lucky request from restoring traffic to an unstable provider.
  • Listeners for observability — The LLMBreakerListener pushes state changes to your logging and metrics systems. In production, you absolutely need to know when circuits are opening and closing.

Async Circuit Breakers with aiobreaker

For async agents, use aiobreaker, which provides native asyncio support.

from aiobreaker import CircuitBreaker
from datetime import timedelta

openai_breaker = CircuitBreaker(
    fail_max=5,
    reset_timeout=timedelta(seconds=60),
    name="openai-async",
)

@openai_breaker
async def call_openai_async(client, messages, **kwargs):
    return await client.chat.completions.create(
        model="gpt-4.1",
        messages=messages,
        timeout=30,
        **kwargs,
    )

Layering Patterns: Defense-in-Depth Architecture

The real power comes from combining all three patterns into a layered defense. Here's the order they should execute, from outermost to innermost.

┌─────────────────────────────────────────────┐
│         Your Agent / Business Logic          │
├─────────────────────────────────────────────┤
│   Response Validation (Pydantic / Zod)       │  ← Catches malformed output
├─────────────────────────────────────────────┤
│   Observability (logging, metrics, traces)   │  ← Records everything
├─────────────────────────────────────────────┤
│   Circuit Breaker (per provider)             │  ← Fast-fails during outages
├─────────────────────────────────────────────┤
│   Fallback Chain (multi-provider routing)    │  ← Switches providers on failure
├─────────────────────────────────────────────┤
│   Retry with Exponential Backoff + Jitter    │  ← Handles transient errors
├─────────────────────────────────────────────┤
│         LLM Provider API                     │
└─────────────────────────────────────────────┘

So, let's wire all three patterns together into a complete implementation.

import pybreaker
import openai
import anthropic
import logging
from tenacity import (
    retry,
    retry_if_exception_type,
    stop_after_attempt,
    wait_exponential_jitter,
)
from pydantic import BaseModel, ValidationError

logger = logging.getLogger(__name__)

# --- Layer 1: Response validation ---
class AgentResponse(BaseModel):
    content: str
    model: str
    provider: str

# --- Layer 2: Circuit breakers per provider ---
openai_breaker = pybreaker.CircuitBreaker(
    fail_max=5, reset_timeout=60, name="openai"
)
anthropic_breaker = pybreaker.CircuitBreaker(
    fail_max=5, reset_timeout=60, name="anthropic"
)

# --- Layer 3: Retry-wrapped provider calls ---
@retry(
    retry=retry_if_exception_type((
        openai.RateLimitError,
        openai.APITimeoutError,
    )),
    wait=wait_exponential_jitter(initial=1, max=30, jitter=3),
    stop=stop_after_attempt(3),
    reraise=True,
)
@openai_breaker
def _call_openai(client, messages, **kwargs):
    resp = client.chat.completions.create(
        model="gpt-4.1", messages=messages, timeout=30, **kwargs
    )
    return AgentResponse(
        content=resp.choices[0].message.content,
        model="gpt-4.1",
        provider="openai",
    )

@retry(
    retry=retry_if_exception_type((Exception,)),
    wait=wait_exponential_jitter(initial=1, max=30, jitter=3),
    stop=stop_after_attempt(3),
    reraise=True,
)
@anthropic_breaker
def _call_anthropic(client, messages, **kwargs):
    system_msg = next(
        (m["content"] for m in messages if m["role"] == "system"), ""
    )
    user_msgs = [m for m in messages if m["role"] != "system"]
    resp = client.messages.create(
        model="claude-sonnet-4-20250514",
        system=system_msg,
        messages=user_msgs,
        max_tokens=4096,
        timeout=30,
    )
    return AgentResponse(
        content=resp.content[0].text,
        model="claude-sonnet-4-20250514",
        provider="anthropic",
    )

# --- Layer 4: Fallback chain ---
def resilient_llm_call(messages: list[dict], **kwargs) -> AgentResponse:
    """Complete resilient LLM call with retry + circuit breaker + fallback."""
    providers = [
        ("openai", _call_openai, openai.OpenAI()),
        ("anthropic", _call_anthropic, anthropic.Anthropic()),
    ]
    errors = []

    for name, call_fn, client in providers:
        try:
            response = call_fn(client, messages, **kwargs)
            logger.info(f"Success via {name}: {response.model}")
            return response
        except pybreaker.CircuitBreakerError:
            logger.warning(f"Circuit open for {name}, skipping")
            errors.append(f"{name}: circuit open")
        except Exception as e:
            logger.error(f"Failed {name}: {e}")
            errors.append(f"{name}: {e}")

    raise RuntimeError(f"All providers failed: {errors}")

Notice the ordering: the @retry decorator wraps the @circuit_breaker decorator, which wraps the actual API call. This means retries happen inside the circuit breaker. If the retries exhaust their attempts, the circuit breaker records a failure. After enough failures, the circuit opens and subsequent calls skip the provider entirely — no retries, no waiting. It's elegant once you see it in action.

Graceful Degradation Strategies

Sometimes even your fallback chain can't produce a full-quality response. Maybe all premium models are rate-limited, or the task requires a capability only one provider supports. Graceful degradation means your agent delivers something useful rather than crashing.

Tiered Degradation

class DegradationTier:
    FULL = "full"           # Primary model, full capabilities
    REDUCED = "reduced"     # Fallback model, may have lower quality
    CACHED = "cached"       # Return cached response from similar query
    STATIC = "static"       # Return pre-written template response

def degraded_response(
    messages: list[dict],
    cache: dict,
    **kwargs,
) -> tuple[AgentResponse, str]:
    """Try each degradation tier in order."""
    # Tier 1: Full quality
    try:
        return resilient_llm_call(messages, **kwargs), DegradationTier.FULL
    except RuntimeError:
        pass

    # Tier 2: Cached response from semantic similarity search
    cache_key = messages[-1]["content"][:100]
    if cache_key in cache:
        logger.warning("Serving cached response")
        return cache[cache_key], DegradationTier.CACHED

    # Tier 3: Static fallback
    logger.error("All tiers exhausted — returning static response")
    return AgentResponse(
        content="I'm experiencing temporary difficulties. Please try again in a few minutes.",
        model="static",
        provider="none",
    ), DegradationTier.STATIC

Here's the key principle: users tolerate reduced capability far more than they tolerate crashes or hung requests. A slightly less sophisticated answer is infinitely better than no answer at all.

Agent-Specific Resilience: Tool Call Recovery

AI agents face an additional resilience challenge that simple LLM API wrappers don't: tool call failures. When an agent invokes an external tool — a database query, an API call, a code execution environment — that tool can fail independently of the LLM provider. And in my experience, tool failures actually cause more production incidents than LLM API failures do.

from functools import wraps

def resilient_tool(max_retries: int = 2, fallback_value=None):
    """Decorator that adds retry and fallback logic to agent tools."""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            last_error = None
            for attempt in range(max_retries + 1):
                try:
                    result = func(*args, **kwargs)
                    return result
                except Exception as e:
                    last_error = e
                    if attempt < max_retries:
                        logger.warning(
                            f"Tool '{func.__name__}' attempt {attempt + 1} failed: {e}"
                        )
                        continue

            # All retries exhausted
            if fallback_value is not None:
                logger.error(
                    f"Tool '{func.__name__}' failed after {max_retries + 1} attempts. "
                    f"Returning fallback: {fallback_value}"
                )
                return fallback_value

            raise last_error
        return wrapper
    return decorator

# Usage — tool returns empty list instead of crashing the agent
@resilient_tool(max_retries=2, fallback_value=[])
def search_knowledge_base(query: str) -> list[dict]:
    """Search the vector database for relevant documents."""
    results = vector_db.similarity_search(query, k=5)
    return [{"content": r.page_content, "score": r.score} for r in results]

@resilient_tool(max_retries=1, fallback_value={"error": "Service unavailable"})
def call_external_api(endpoint: str, params: dict) -> dict:
    """Call an external API with resilience."""
    response = requests.get(endpoint, params=params, timeout=10)
    response.raise_for_status()
    return response.json()

Tool resilience is especially important for agentic RAG pipelines, where a vector database failure shouldn't crash the entire agent. The agent can still attempt to answer from its parametric knowledge or inform the user that retrieval is temporarily unavailable.

Monitoring and Observability for Resilience

Resilience patterns are only as good as your ability to observe them in production. You need to track specific metrics to know whether your retry, fallback, and circuit breaker configurations are actually working — or silently failing in ways you won't notice until a major outage hits.

Key Metrics to Track

  • Retry rate per provider — If retries spike from 5% to 40%, something's wrong upstream. Set alerts at 20%.
  • Fallback activation rate — How often is your primary provider failing? If fallbacks trigger more than 10% of the time, reconsider your provider choice or tier.
  • Circuit breaker state changes — Every OPEN/CLOSE transition should trigger an alert. Frequent cycling indicates an unstable provider.
  • P95 latency with vs. without retries — Retries add latency. Track the impact so your backoff configuration doesn't degrade user experience.
  • Cost per successful request — Fallbacks to more expensive models inflate costs. Track cost-per-success to catch runaway spending before it becomes a budget problem.
  • Degradation tier distribution — What percentage of responses are served from cache or static fallbacks? This tells you the actual quality your users are experiencing.
import time
from dataclasses import dataclass, field

@dataclass
class ResilienceMetrics:
    """Lightweight metrics tracker for resilience patterns."""
    total_requests: int = 0
    retries: int = 0
    fallback_activations: int = 0
    circuit_opens: int = 0
    latencies: list[float] = field(default_factory=list)

    @property
    def retry_rate(self) -> float:
        return self.retries / max(self.total_requests, 1)

    @property
    def fallback_rate(self) -> float:
        return self.fallback_activations / max(self.total_requests, 1)

    @property
    def p95_latency(self) -> float:
        if not self.latencies:
            return 0.0
        sorted_lat = sorted(self.latencies)
        idx = int(len(sorted_lat) * 0.95)
        return sorted_lat[idx]

    def report(self) -> dict:
        return {
            "total_requests": self.total_requests,
            "retry_rate": f"{self.retry_rate:.1%}",
            "fallback_rate": f"{self.fallback_rate:.1%}",
            "circuit_opens": self.circuit_opens,
            "p95_latency_ms": f"{self.p95_latency * 1000:.0f}",
        }

Production Checklist

Before shipping your resilient AI agent, run through every item on this checklist. I keep this pinned in my project management tool for every agent deployment.

  1. Error classification — Every LLM and tool error is classified as transient, permanent, or degraded. No unclassified exceptions leak through.
  2. Retries configured per provider — Each provider has its own retry policy with exponential backoff and jitter. No global retry-everything policies.
  3. Circuit breakers per provider — Separate circuit breakers for each provider with appropriate fail_max and reset_timeout values tuned to your traffic patterns.
  4. Fallback chain tested — Manually trigger failures in each provider to verify fallback routing works end-to-end. Test monthly.
  5. Graceful degradation path — When all providers fail, the agent returns a useful response (cached or static) rather than crashing.
  6. Tool resilience — All external tool calls have retry logic and fallback values. A tool failure doesn't crash the agent loop.
  7. Observability in place — Retry rates, fallback activations, circuit state changes, and latency distributions are tracked and alerted on.
  8. Timeout on every external call — No LLM call or tool invocation runs without an explicit timeout. Stuck calls are the silent killer of production agents.
  9. Load testing with failure injection — Simulate provider outages, rate limits, and slow responses under realistic traffic to validate your resilience stack.

Frequently Asked Questions

How many retry attempts should I configure for LLM API calls?

Three to five attempts is the standard range for production LLM applications. Start with 3 retries using exponential backoff with jitter (1s, 2s, 4s base delays). If your use case is latency-sensitive (like a chatbot), use 2-3 retries with shorter maximums. For batch processing or background agents, you can go up to 5 retries with longer backoff windows. The key is always pairing retries with a hard timeout — you don't want a request to hang for 5 minutes through 5 retry cycles.

What is the difference between a retry, a fallback, and a circuit breaker?

Retries re-attempt the same request to the same provider after a transient failure, usually with increasing wait times (exponential backoff). Fallbacks route the request to a different provider or model when the primary one fails — they expand where you send requests. Circuit breakers monitor failure rates over time and stop sending requests entirely when a provider is unhealthy, preventing your system from wasting resources on a known-broken service. They work best together: retries handle transient blips, fallbacks handle provider-specific outages, and circuit breakers prevent cascading failures during prolonged outages.

Should I use the same fallback model for all types of failures?

No — and this is a mistake I see a lot of teams make. Different failure types benefit from different fallback strategies. For rate limit errors, fall back to the same model at a different provider or region. For context window errors, fall back to a model with a larger context (e.g., Claude with 200K tokens). For content policy rejections, fall back to a provider with different safety filters. For latency issues, fall back to a smaller, faster model. LiteLLM supports these distinctions natively with context_window_fallbacks and content_policy_fallbacks configuration.

How do circuit breakers work in distributed or multi-instance deployments?

By default, circuit breakers like PyBreaker maintain state in-process, meaning each instance of your agent has its own independent circuit state. In a distributed deployment, this means one instance might have an open circuit while another is still sending requests. For coordinated circuit breaking, use PyBreaker's Redis-backed state storage (CircuitRedisStorage) so all instances share the same circuit state. This matters most for high-traffic systems where you want all instances to stop hitting a failing provider at the same time.

How do I test resilience patterns before deploying to production?

Start with unit tests that mock provider failures (rate limits, timeouts, connection errors) and verify that retries, fallbacks, and circuit breakers activate correctly. Then use chaos engineering in staging: inject failures using tools like LiteLLM's mock_testing_fallbacks=True flag, or use a proxy that randomly drops or delays requests. Finally, run load tests with failure injection to validate behavior under realistic concurrent traffic. The goal is to verify not just that each pattern works in isolation, but that the layered defense behaves correctly when multiple failures compound simultaneously.

About the Author Editorial Team

Our team of expert writers and editors.