20.25 Cancelling Discriminator

Fires on the first completing branch, then cancels all remaining active branches immediately. More resource-efficient than 20.24 — no waiting for stragglers — but discards all in-flight work on the losing branches.

Motivating Scenario

A fast-inference routing system dispatches identical prompts to GPT-4, Claude, and Gemini simultaneously via three parallel HTTP connections. The first model to return a complete response wins. The moment that response is received, the other two HTTP connections are immediately cancelled — the sockets are closed, any streaming tokens are discarded, and the API calls are aborted. The router delivers the winning response with no further delay.

The key insight: the losing models' partial work has zero value. Unlike a redundant database write (which cannot be "uncancelled" once committed), an in-flight LLM inference can be aborted at any point — the HTTP connection is stateless from the application's perspective. Cancellation is free, and the latency benefit of not waiting for stragglers is significant. In a routing system where the three models differ by 200..2000ms in response time, the cancelling discriminator captures the full latency advantage of hedged requests with no resource leakage penalty.

Structure

Zoom and pan enabled · Concrete example: multi-LLM hedged inference router

Key Metrics

Metric	Signal
P50 / P95 delivery latency	Primary signal — should track below the fastest individual endpoint's P50 due to hedging benefit
Win rate by model	Which endpoint wins most often — reveals systematic latency asymmetry; if one model wins 90%+, the others add cost without benefit
Cancelled token waste ratio	Tokens generated by losing branches as fraction of total tokens billed — measures cost efficiency of hedging
Cancellation failure rate	Fraction of cancel signals that failed to stop branch execution — non-zero values indicate resource leakage

Node	What it does	What it receives	What it produces
Dispatch Prompt	Fans the prompt to all three model endpoints simultaneously via AND-split. Opens parallel HTTP connections.	Single user prompt	Three concurrent inference requests
GPT-4 / Claude / Gemini	Each model executes the inference independently. May be cancelled mid-stream by the discriminator.	Prompt + model-specific parameters	Full response token (if not cancelled)
First Response Wins (OR-join)	Fires on the first arriving complete response. Immediately signals cancellation to the remaining two branches. Does not wait.	First arriving model response	Winner response + cancellation signal to losers
Deliver Output	Returns the winning response to the caller. Executes without waiting for cancellation to confirm.	Winner response from OR-join	Response to user

When to Use

Use when

Branches are stateless and safely cancellable at any point (HTTP connections, streaming completions)
Latency is the primary optimization target
Partial branch results have no value — only the winner matters
API costs for cancelled calls are charged per-token or not at all before completion
Branch count is fixed at design time

Avoid when

Branches perform irreversible side effects (database writes, payment charges, email sends)
Cancellation is not reliable — use 20.24 instead
All branch responses are needed for audit or comparison purposes
Branch resource cost is charged on initiation, not completion (sunk cost makes cancellation pointless)

Value Profile

Origin of Value	Where it appears	How it is captured
Future Cashflow	User-facing latency	The discriminator's P50 latency equals the fastest endpoint's P50. For multi-model routing, this is typically 30..60% faster than single-model P50.
Conditional Action	Cancelled branches	Cancelled branches consume partial compute. If providers charge per-token on partial completions, the "free cancellation" assumption breaks. Validate pricing model before deploying.
Risk Exposure	OR-join fire condition	If the first arriving response is malformed or truncated, it is delivered as the winner. No quality check runs before delivery. Fix: add a lightweight response validator inside each branch before the response token is emitted to the OR-join.
Governance	Cancellation mechanism	The reliability of cancellation is a system-level governance assumption. If cancellation silently fails (branch continues running), the pattern degrades to an untracked 20.24 without cleanup tracking.

Hedged requests at scale. At 1000 requests/minute, a cancelling discriminator over three endpoints effectively triples the inference capacity available for latency reduction — each request "samples" from three models simultaneously. The cost premium is bounded by the fraction of cancelled tokens, which decreases as model latency variance decreases.

Dynamics and Failure Modes

Cancellation confirmation lag

The OR-join fires and signals cancellation, but the losing branches continue processing for 50..200ms before the cancellation signal propagates. During this window, the losing models may complete and attempt to write their results. Fix: the OR-join should hold a "fired" flag that causes any late-arriving branch completions to be silently discarded rather than re-triggering delivery.

All branches return errors

All three models return error responses (rate limit, context length exceeded, service unavailable). The OR-join fires on the first error and delivers it as the "winner". Fix: add error classification inside each branch — only success tokens are emitted to the OR-join. Error tokens route to a fallback handler. If all three fail, the fallback handler takes over.

Cost overrun on partial tokens

Provider billing charges per input token even for cancelled requests. With three parallel dispatches, every request is billed 3x on input tokens regardless of which model wins. Fix: route low-latency-expectation queries to a single fast model by default, and only invoke the hedged discriminator when the routing policy flags a high-value or latency-sensitive request.

Winner quality below single-model baseline

The fastest model is consistently the lowest-quality one (e.g., a quantized local model wins 80% of the time). The discriminator optimizes for speed and inadvertently degrades response quality. Fix: add a minimum quality filter before the OR-join fire, or weight the selection function to discount branches from models with historically lower quality scores.

Variants

Variant	Modification	When to use
Quality-Filtered Discriminator	Branch result passes a lightweight validator before being eligible to win; invalid responses are suppressed	First-arrival alone is not a sufficient selection criterion — minimum quality must be enforced
Tiered Hedging	Primary model is dispatched first; backup models are dispatched only after a timeout threshold (e.g., 500ms with no response)	Hedging cost must be minimized; backup dispatch only when primary is slow
Cancelling Discriminator with Audit	Winning response and the identities of cancelled branches are logged before delivery	Model comparison research requires tracking which model would have won on each request

Related Patterns

Pattern	Relationship
40.44 Blocking Discriminator	The wait-for-all variant — use when branches have side effects that cannot be safely cancelled
20.24 Competitive Evaluation	Higher-level pattern: runs all branches to completion and selects best output — use when all responses have value
10.14 Retry Fallback	Sequential fallback — slower but avoids duplicate compute cost entirely; use when hedging cost is not justified