10.14 Retry-Fallback

A primary agent attempts the request; on failure or timeout it retries up to a limit, then escalates to a fallback agent using a lower-priority model or cached response — normalizing all outcomes through a single Response Merger before returning to the caller.


Motivating Scenario

A B2B SaaS platform routes 40,000 AI completions per day across a model fleet. The primary model (GPT-4) times out or returns a 5xx error on roughly 2% of requests — 800 failures per day — due to provider rate limits and occasional outages. Without a fallback, each of those 800 requests surfaces as a hard error to the customer. At 0.8% visible error rate, enterprise customers begin filing SLA breach reports.

After deploying a Retry-Fallback pattern with the chain GPT-4 → Claude Sonnet → Llama-3-70B (self-hosted), the customer-visible error rate drops to 0.04%: 16 requests per day reach the degraded response path, and most of those are during full provider outages affecting all API models simultaneously. The fallback chain consumes roughly 3% of total compute cost while eliminating 98% of customer-visible failures. The self-hosted Llama tier additionally decouples the platform from simultaneous multi-provider outages.

Structure

Zoom and pan enabled · Concrete example: B2B SaaS multi-model completion routing

Key Metrics

MetricSignal
Customer-visible error rate Primary SLA metric — percentage of requests that return a hard error or degraded response to the caller. Target varies by SLA tier; typical enterprise target is below 0.1%.
Fallback invocation rate Percentage of requests that reach the Fallback Agent. Rising trend signals primary model degradation before it becomes visible to customers. Sudden spikes indicate provider outage.
Retry efficiency ratio Percentage of retries that succeed (primary recovers on second or third attempt) vs. total retry attempts. A low ratio means retries are consuming cost without recovering requests — reduce retry limit and route to fallback sooner.
Fallback quality delta Downstream error or correction rate for fallback-sourced responses vs. primary-sourced responses. Non-zero delta quantifies the quality cost of fallback invocations and informs whether the fallback chain is adequate.
Degraded response rate Percentage of requests resolved by the Degraded Response path. Should be near zero in normal operation. A sustained non-zero rate indicates multi-provider outage or a systematic issue with fallback provisioning.
NodeWhat it doesWhat it receivesWhat it produces
Primary Agent Attempts completion with the highest-quality model in the fleet; records attempt count and any error code on failure Request payload + attempt count Completion response or error signal with error type
Success Gateway Evaluates whether the Primary Agent returned a successful response; branches to Response Merger on success, Retry Gate on failure Agent response or error signal Routing decision: success path or retry path
Retry Gate Checks whether the attempt count is below the configured retry limit (default: 2); routes back to Primary Agent if retries remain, to Fallback Agent if exhausted Error signal + attempt count Routing decision: retry primary or escalate to fallback
Fallback Agent Attempts completion with the next model in the fallback chain (Claude Sonnet or Llama-3-70B); applies the same request payload with any model-specific prompt adjustments Original request payload + fallback model config Completion response or failure signal
Degraded Response Returns a cached response for the nearest matching prior request, or a structured graceful-degradation message indicating which capability is unavailable Original request + response cache Cached or minimal response with degradation flag
Response Merger Normalizes the response schema regardless of source (primary, fallback, or degraded); attaches metadata indicating which path was taken and model used Response from any path + source metadata Normalized response with source tag and latency metrics

When to Use

Use when
Avoid when

Value Profile

Origin of ValueWhere it appearsHow it is captured
Future Cashflow SLA preservation Customer-visible error rate is a direct churn driver in B2B SaaS. Reducing it from 0.8% to 0.04% is not a 95% improvement — it is the difference between SLA compliance and SLA breach. The Retry-Fallback pattern monetizes reliability without requiring the primary model to be more reliable.
Risk Exposure Primary model failure rate Dependency on a single model provider is a concentration risk. The fallback chain is the hedge. Its value is inverse to primary reliability: when the primary is healthy, fallback consumes near-zero resources; when the primary degrades, fallback absorbs the entire load.
Conditional Action Retry loop compute Each retry attempt consumes the same compute as the original request. Two retries before fallback can triple the cost of a failed primary request. At 2% failure rate with 2 retries, total cost increase is 4% — acceptable. At 20% failure rate, retries become a significant cost center.
Governance Fallback selection policy The choice of fallback model encodes a quality floor policy. Using a self-hosted open model as the final tier removes dependence on any external provider and establishes an absolute availability guarantee. This governance decision has security and data-residency implications that must be evaluated at design time.
VCM analog: Redundant Work Token. The fallback agent earns its position only during primary failure. In steady state it consumes no compute and produces no value — but its existence is what makes the primary's SLA contractually defensible. Value is latent reliability, not throughput. Like a standby node in a distributed system, the fallback's cost is denominated in optionality, not utilization.

Dynamics and Failure Modes

Thundering herd on model outage

A primary model provider experiences a 15-minute partial outage. Failure rate spikes from 2% to 60%. All 24,000 affected requests immediately retry twice and then hit the fallback agent simultaneously. The fallback agent (Claude Sonnet) was provisioned for 800 requests per hour at steady state — not 24,000 requests in 15 minutes. Fallback latency spikes from 800ms to 45 seconds; many requests time out at the fallback tier. The thundering herd turns a partial primary outage into a near-total service outage. Fix: implement exponential backoff with jitter on retries; use a Circuit Breaker to stop sending to the primary once error rate exceeds a threshold, routing directly to fallback rather than attempting primary at all.

Fallback quality cliff (silent degradation)

The Retry-Fallback chain escalates to Llama-3-70B as the tertiary model. For simple completions, Llama-3-70B performs comparably to GPT-4. For complex multi-step reasoning tasks that constitute 15% of the request mix, it produces subtly wrong outputs that pass structural validation but contain reasoning errors. These errors are invisible to the Response Merger, which only normalizes format. Customers notice incorrect outputs hours later when reviewing AI-generated reports. Fix: log the source model tag in every response. Monitor downstream error rates segmented by source model. If fallback-sourced responses have higher downstream error rates, implement task-type routing: route complex tasks to Fallback Agent only if a simpler fallback model can handle them; otherwise surface a partial failure rather than a confidently wrong answer.

Retry amplification on non-transient errors

A request with a malformed system prompt that triggers a content policy rejection receives a 400 error. The Retry Gate does not distinguish between transient errors (5xx, timeout) and non-transient errors (4xx). It retries the identical request twice, receiving the same 400 error each time, then escalates to the Fallback Agent with the same malformed prompt. The Fallback Agent also rejects it. All three model calls are wasted compute, and the Degraded Response path returns a cached answer that is unrelated to the actual request. Fix: classify errors before the Retry Gate. Non-transient errors (4xx, content policy, token limit exceeded) should bypass the retry loop and route directly to a validation error handler, not to fallback.

Variants

VariantModificationWhen to use
Circuit Breaker Monitors primary error rate over a rolling window; when it exceeds a threshold, "opens the circuit" — all requests route directly to fallback without attempting primary, until a probe request succeeds and the circuit closes Provider outage scenarios where retrying primary amplifies load rather than recovering requests; prevents thundering herd at the cost of latency optimization during recovery
Fallback Chain Extends the single fallback to a prioritized list of three or more models (e.g., GPT-4 → Claude Sonnet → Llama-3-70B → cached response); each tier is attempted before escalating to the next Maximum availability requirements across simultaneous multi-provider outages; self-hosted model as final tier removes all external dependencies
Speculative Execution Fires primary and fallback simultaneously from the start; uses the first successful response and cancels the other; discards the retry logic entirely Latency is more critical than cost; acceptable to pay 2x compute on every request to eliminate the latency penalty of sequential failure detection and fallback escalation

Related Patterns

PatternRelationship
10.11 PipelineRetry-Fallback is a single-node resilience pattern; pipeline-level retry is more complex because failure at stage N requires re-running from stage N, not from the beginning
10.15 Evaluator-OptimizerQuality-driven retry (Evaluator rejects output and triggers regeneration) is architecturally similar but semantically distinct — availability-driven retry uses error codes; quality-driven retry uses output scoring
20.23 Orchestrator-WorkersSpeculative Execution variant converges with competitive parallel execution — run primary and fallback simultaneously and use the first valid response, which is the core mechanic of competitive evaluation in multi-worker patterns

Investment Signal

Retry-Fallback is infrastructure, not product. A firm that has built a multi-model fallback chain with self-hosted tertiary models has made a capital investment in availability that competitors cannot easily replicate. The self-hosted tier is particularly valuable: it converts external API risk into internal operational risk, which is controllable.

The Response Merger is the integration point. If the merger produces a normalized schema regardless of source model, downstream consumers are shielded from model-specific output variations. This schema stability is what makes the fallback chain invisible to the application layer — and invisible infrastructure is durable infrastructure.

Due diligence question: can the firm demonstrate the fallback invocation rate and fallback quality delta over the last 90 days? If they cannot segment response quality by source model, they do not know whether their fallback chain is trustworthy or merely available — and those are not the same property.