10.14 Retry-Fallback — AI-Native Organization Patterns

A primary agent attempts the request; on failure or timeout it retries up to a limit, then escalates to a fallback agent using a lower-priority model or cached response — normalizing all outcomes through a single Response Merger before returning to the caller.

Motivating Scenario

A B2B SaaS platform routes 40,000 AI completions per day across a model fleet. The primary model (GPT-4) times out or returns a 5xx error on roughly 2% of requests — 800 failures per day — due to provider rate limits and occasional outages. Without a fallback, each of those 800 requests surfaces as a hard error to the customer. At 0.8% visible error rate, enterprise customers begin filing SLA breach reports.

After deploying a Retry-Fallback pattern with the chain GPT-4 → Claude Sonnet → Llama-3-70B (self-hosted), the customer-visible error rate drops to 0.04%: 16 requests per day reach the degraded response path, and most of those are during full provider outages affecting all API models simultaneously. The fallback chain consumes roughly 3% of total compute cost while eliminating 98% of customer-visible failures. The self-hosted Llama tier additionally decouples the platform from simultaneous multi-provider outages.

Structure

Key Metrics

Metric	Signal
Customer-visible error rate	Primary SLA metric — percentage of requests that return a hard error or degraded response to the caller. Target varies by SLA tier; typical enterprise target is below 0.1%.
Fallback invocation rate	Percentage of requests that reach the Fallback Agent. Rising trend signals primary model degradation before it becomes visible to customers. Sudden spikes indicate provider outage.
Retry efficiency ratio	Percentage of retries that succeed (primary recovers on second or third attempt) vs. total retry attempts. A low ratio means retries are consuming cost without recovering requests — reduce retry limit and route to fallback sooner.
Fallback quality delta	Downstream error or correction rate for fallback-sourced responses vs. primary-sourced responses. Non-zero delta quantifies the quality cost of fallback invocations and informs whether the fallback chain is adequate.
Degraded response rate	Percentage of requests resolved by the Degraded Response path. Should be near zero in normal operation. A sustained non-zero rate indicates multi-provider outage or a systematic issue with fallback provisioning.

Metric

Signal

Customer-visible error rate

Primary SLA metric — percentage of requests that return a hard error or degraded response to the caller. Target varies by SLA tier; typical enterprise target is below 0.1%.

Fallback invocation rate

Percentage of requests that reach the Fallback Agent. Rising trend signals primary model degradation before it becomes visible to customers. Sudden spikes indicate provider outage.

Retry efficiency ratio

Percentage of retries that succeed (primary recovers on second or third attempt) vs. total retry attempts. A low ratio means retries are consuming cost without recovering requests — reduce retry limit and route to fallback sooner.

Fallback quality delta

Downstream error or correction rate for fallback-sourced responses vs. primary-sourced responses. Non-zero delta quantifies the quality cost of fallback invocations and informs whether the fallback chain is adequate.

Degraded response rate

Percentage of requests resolved by the Degraded Response path. Should be near zero in normal operation. A sustained non-zero rate indicates multi-provider outage or a systematic issue with fallback provisioning.

Node	What it does	What it receives	What it produces
Primary Agent	Attempts completion with the highest-quality model in the fleet; records attempt count and any error code on failure	Request payload + attempt count	Completion response or error signal with error type
Success Gateway	Evaluates whether the Primary Agent returned a successful response; branches to Response Merger on success, Retry Gate on failure	Agent response or error signal	Routing decision: success path or retry path
Retry Gate	Checks whether the attempt count is below the configured retry limit (default: 2); routes back to Primary Agent if retries remain, to Fallback Agent if exhausted	Error signal + attempt count	Routing decision: retry primary or escalate to fallback
Fallback Agent	Attempts completion with the next model in the fallback chain (Claude Sonnet or Llama-3-70B); applies the same request payload with any model-specific prompt adjustments	Original request payload + fallback model config	Completion response or failure signal
Degraded Response	Returns a cached response for the nearest matching prior request, or a structured graceful-degradation message indicating which capability is unavailable	Original request + response cache	Cached or minimal response with degradation flag
Response Merger	Normalizes the response schema regardless of source (primary, fallback, or degraded); attaches metadata indicating which path was taken and model used	Response from any path + source metadata	Normalized response with source tag and latency metrics

Node

What it does

What it receives

What it produces

Primary Agent

Attempts completion with the highest-quality model in the fleet; records attempt count and any error code on failure

Request payload + attempt count

Completion response or error signal with error type

Success Gateway

Evaluates whether the Primary Agent returned a successful response; branches to Response Merger on success, Retry Gate on failure

Agent response or error signal

Routing decision: success path or retry path

Retry Gate

Checks whether the attempt count is below the configured retry limit (default: 2); routes back to Primary Agent if retries remain, to Fallback Agent if exhausted

Error signal + attempt count

Routing decision: retry primary or escalate to fallback

Fallback Agent

Attempts completion with the next model in the fallback chain (Claude Sonnet or Llama-3-70B); applies the same request payload with any model-specific prompt adjustments

Original request payload + fallback model config

Completion response or failure signal

Degraded Response

Returns a cached response for the nearest matching prior request, or a structured graceful-degradation message indicating which capability is unavailable

Original request + response cache

Cached or minimal response with degradation flag

Response Merger

Normalizes the response schema regardless of source (primary, fallback, or degraded); attaches metadata indicating which path was taken and model used

Response from any path + source metadata

Normalized response with source tag and latency metrics

When to Use

Use when

A single model provider's reliability is below your SLA target
Availability matters more than uniform response quality — some answer is better than no answer
You have access to multiple models at different cost/quality points
Request volume is high enough that even 1-2% failure rates produce visible customer impact
You maintain a response cache that can serve degraded but useful answers

Avoid when

Fallback model quality is so much lower that a fallback response is worse than a clear failure message
The task is idempotent-sensitive — retrying may cause duplicate side effects (e.g., sending an email twice)
You need quality-driven retry rather than availability-driven retry — use Evaluator-Optimizer instead
Primary model failures are correlated with request content rather than provider availability — retrying the same request produces the same failure

Value Profile

Origin of Value	Where it appears	How it is captured
Future Cashflow	SLA preservation	Customer-visible error rate is a direct churn driver in B2B SaaS. Reducing it from 0.8% to 0.04% is not a 95% improvement — it is the difference between SLA compliance and SLA breach. The Retry-Fallback pattern monetizes reliability without requiring the primary model to be more reliable.
Risk Exposure	Primary model failure rate	Dependency on a single model provider is a concentration risk. The fallback chain is the hedge. Its value is inverse to primary reliability: when the primary is healthy, fallback consumes near-zero resources; when the primary degrades, fallback absorbs the entire load.
Conditional Action	Retry loop compute	Each retry attempt consumes the same compute as the original request. Two retries before fallback can triple the cost of a failed primary request. At 2% failure rate with 2 retries, total cost increase is 4% — acceptable. At 20% failure rate, retries become a significant cost center.
Governance	Fallback selection policy	The choice of fallback model encodes a quality floor policy. Using a self-hosted open model as the final tier removes dependence on any external provider and establishes an absolute availability guarantee. This governance decision has security and data-residency implications that must be evaluated at design time.

Origin of Value

Where it appears

How it is captured

Future Cashflow

SLA preservation

Customer-visible error rate is a direct churn driver in B2B SaaS. Reducing it from 0.8% to 0.04% is not a 95% improvement — it is the difference between SLA compliance and SLA breach. The Retry-Fallback pattern monetizes reliability without requiring the primary model to be more reliable.

Risk Exposure

Primary model failure rate

Dependency on a single model provider is a concentration risk. The fallback chain is the hedge. Its value is inverse to primary reliability: when the primary is healthy, fallback consumes near-zero resources; when the primary degrades, fallback absorbs the entire load.

Conditional Action

Retry loop compute

Each retry attempt consumes the same compute as the original request. Two retries before fallback can triple the cost of a failed primary request. At 2% failure rate with 2 retries, total cost increase is 4% — acceptable. At 20% failure rate, retries become a significant cost center.

Governance

Fallback selection policy

The choice of fallback model encodes a quality floor policy. Using a self-hosted open model as the final tier removes dependence on any external provider and establishes an absolute availability guarantee. This governance decision has security and data-residency implications that must be evaluated at design time.

Dynamics and Failure Modes

Variants

Variant	Modification	When to use
Circuit Breaker	Monitors primary error rate over a rolling window; when it exceeds a threshold, "opens the circuit" — all requests route directly to fallback without attempting primary, until a probe request succeeds and the circuit closes	Provider outage scenarios where retrying primary amplifies load rather than recovering requests; prevents thundering herd at the cost of latency optimization during recovery
Fallback Chain	Extends the single fallback to a prioritized list of three or more models (e.g., GPT-4 → Claude Sonnet → Llama-3-70B → cached response); each tier is attempted before escalating to the next	Maximum availability requirements across simultaneous multi-provider outages; self-hosted model as final tier removes all external dependencies
Speculative Execution	Fires primary and fallback simultaneously from the start; uses the first successful response and cancels the other; discards the retry logic entirely	Latency is more critical than cost; acceptable to pay 2x compute on every request to eliminate the latency penalty of sequential failure detection and fallback escalation

Variant

Modification

When to use

Circuit Breaker

Monitors primary error rate over a rolling window; when it exceeds a threshold, "opens the circuit" — all requests route directly to fallback without attempting primary, until a probe request succeeds and the circuit closes

Provider outage scenarios where retrying primary amplifies load rather than recovering requests; prevents thundering herd at the cost of latency optimization during recovery

Fallback Chain

Extends the single fallback to a prioritized list of three or more models (e.g., GPT-4 → Claude Sonnet → Llama-3-70B → cached response); each tier is attempted before escalating to the next

Maximum availability requirements across simultaneous multi-provider outages; self-hosted model as final tier removes all external dependencies

Speculative Execution

Fires primary and fallback simultaneously from the start; uses the first successful response and cancels the other; discards the retry logic entirely

Latency is more critical than cost; acceptable to pay 2x compute on every request to eliminate the latency penalty of sequential failure detection and fallback escalation

Related Patterns

Pattern	Relationship
10.11 Pipeline	Retry-Fallback is a single-node resilience pattern; pipeline-level retry is more complex because failure at stage N requires re-running from stage N, not from the beginning
10.15 Evaluator-Optimizer	Quality-driven retry (Evaluator rejects output and triggers regeneration) is architecturally similar but semantically distinct — availability-driven retry uses error codes; quality-driven retry uses output scoring
20.23 Orchestrator-Workers	Speculative Execution variant converges with competitive parallel execution — run primary and fallback simultaneously and use the first valid response, which is the core mechanic of competitive evaluation in multi-worker patterns

Pattern

Relationship

10.11 Pipeline

Retry-Fallback is a single-node resilience pattern; pipeline-level retry is more complex because failure at stage N requires re-running from stage N, not from the beginning

10.15 Evaluator-Optimizer

Quality-driven retry (Evaluator rejects output and triggers regeneration) is architecturally similar but semantically distinct — availability-driven retry uses error codes; quality-driven retry uses output scoring

20.23 Orchestrator-Workers

Speculative Execution variant converges with competitive parallel execution — run primary and fallback simultaneously and use the first valid response, which is the core mechanic of competitive evaluation in multi-worker patterns

Investment Signal

Retry-Fallback is infrastructure, not product. A firm that has built a multi-model fallback chain with self-hosted tertiary models has made a capital investment in availability that competitors cannot easily replicate. The self-hosted tier is particularly valuable: it converts external API risk into internal operational risk, which is controllable.

The Response Merger is the integration point. If the merger produces a normalized schema regardless of source model, downstream consumers are shielded from model-specific output variations. This schema stability is what makes the fallback chain invisible to the application layer — and invisible infrastructure is durable infrastructure.

Due diligence question: can the firm demonstrate the fallback invocation rate and fallback quality delta over the last 90 days? If they cannot segment response quality by source model, they do not know whether their fallback chain is trustworthy or merely available — and those are not the same property.