20.25 Cancelling Discriminator

Fires on the first completing branch, then cancels all remaining active branches immediately. More resource-efficient than 20.24 — no waiting for stragglers — but discards all in-flight work on the losing branches.


Motivating Scenario

A fast-inference routing system dispatches identical prompts to GPT-4, Claude, and Gemini simultaneously via three parallel HTTP connections. The first model to return a complete response wins. The moment that response is received, the other two HTTP connections are immediately cancelled — the sockets are closed, any streaming tokens are discarded, and the API calls are aborted. The router delivers the winning response with no further delay.

The key insight: the losing models' partial work has zero value. Unlike a redundant database write (which cannot be "uncancelled" once committed), an in-flight LLM inference can be aborted at any point — the HTTP connection is stateless from the application's perspective. Cancellation is free, and the latency benefit of not waiting for stragglers is significant. In a routing system where the three models differ by 200..2000ms in response time, the cancelling discriminator captures the full latency advantage of hedged requests with no resource leakage penalty.

Structure

Zoom and pan enabled · Concrete example: multi-LLM hedged inference router

Key Metrics

MetricSignal
P50 / P95 delivery latency Primary signal — should track below the fastest individual endpoint's P50 due to hedging benefit
Win rate by model Which endpoint wins most often — reveals systematic latency asymmetry; if one model wins 90%+, the others add cost without benefit
Cancelled token waste ratio Tokens generated by losing branches as fraction of total tokens billed — measures cost efficiency of hedging
Cancellation failure rate Fraction of cancel signals that failed to stop branch execution — non-zero values indicate resource leakage
NodeWhat it doesWhat it receivesWhat it produces
Dispatch Prompt Fans the prompt to all three model endpoints simultaneously via AND-split. Opens parallel HTTP connections. Single user prompt Three concurrent inference requests
GPT-4 / Claude / Gemini Each model executes the inference independently. May be cancelled mid-stream by the discriminator. Prompt + model-specific parameters Full response token (if not cancelled)
First Response Wins (OR-join) Fires on the first arriving complete response. Immediately signals cancellation to the remaining two branches. Does not wait. First arriving model response Winner response + cancellation signal to losers
Deliver Output Returns the winning response to the caller. Executes without waiting for cancellation to confirm. Winner response from OR-join Response to user

When to Use

Use when
Avoid when

Value Profile

Origin of ValueWhere it appearsHow it is captured
Future Cashflow User-facing latency The discriminator's P50 latency equals the fastest endpoint's P50. For multi-model routing, this is typically 30..60% faster than single-model P50.
Conditional Action Cancelled branches Cancelled branches consume partial compute. If providers charge per-token on partial completions, the "free cancellation" assumption breaks. Validate pricing model before deploying.
Risk Exposure OR-join fire condition If the first arriving response is malformed or truncated, it is delivered as the winner. No quality check runs before delivery. Fix: add a lightweight response validator inside each branch before the response token is emitted to the OR-join.
Governance Cancellation mechanism The reliability of cancellation is a system-level governance assumption. If cancellation silently fails (branch continues running), the pattern degrades to an untracked 20.24 without cleanup tracking.
Hedged requests at scale. At 1000 requests/minute, a cancelling discriminator over three endpoints effectively triples the inference capacity available for latency reduction — each request "samples" from three models simultaneously. The cost premium is bounded by the fraction of cancelled tokens, which decreases as model latency variance decreases.

Dynamics and Failure Modes

Cancellation confirmation lag

The OR-join fires and signals cancellation, but the losing branches continue processing for 50..200ms before the cancellation signal propagates. During this window, the losing models may complete and attempt to write their results. Fix: the OR-join should hold a "fired" flag that causes any late-arriving branch completions to be silently discarded rather than re-triggering delivery.

All branches return errors

All three models return error responses (rate limit, context length exceeded, service unavailable). The OR-join fires on the first error and delivers it as the "winner". Fix: add error classification inside each branch — only success tokens are emitted to the OR-join. Error tokens route to a fallback handler. If all three fail, the fallback handler takes over.

Cost overrun on partial tokens

Provider billing charges per input token even for cancelled requests. With three parallel dispatches, every request is billed 3x on input tokens regardless of which model wins. Fix: route low-latency-expectation queries to a single fast model by default, and only invoke the hedged discriminator when the routing policy flags a high-value or latency-sensitive request.

Winner quality below single-model baseline

The fastest model is consistently the lowest-quality one (e.g., a quantized local model wins 80% of the time). The discriminator optimizes for speed and inadvertently degrades response quality. Fix: add a minimum quality filter before the OR-join fire, or weight the selection function to discount branches from models with historically lower quality scores.

Variants

VariantModificationWhen to use
Quality-Filtered Discriminator Branch result passes a lightweight validator before being eligible to win; invalid responses are suppressed First-arrival alone is not a sufficient selection criterion — minimum quality must be enforced
Tiered Hedging Primary model is dispatched first; backup models are dispatched only after a timeout threshold (e.g., 500ms with no response) Hedging cost must be minimized; backup dispatch only when primary is slow
Cancelling Discriminator with Audit Winning response and the identities of cancelled branches are logged before delivery Model comparison research requires tracking which model would have won on each request

Related Patterns

PatternRelationship
40.44 Blocking DiscriminatorThe wait-for-all variant — use when branches have side effects that cannot be safely cancelled
20.24 Competitive EvaluationHigher-level pattern: runs all branches to completion and selects best output — use when all responses have value
10.14 Retry FallbackSequential fallback — slower but avoids duplicate compute cost entirely; use when hedging cost is not justified