Fires on the first completing branch, then cancels all remaining active branches immediately. More resource-efficient than 20.24 — no waiting for stragglers — but discards all in-flight work on the losing branches.
A fast-inference routing system dispatches identical prompts to GPT-4, Claude, and Gemini simultaneously via three parallel HTTP connections. The first model to return a complete response wins. The moment that response is received, the other two HTTP connections are immediately cancelled — the sockets are closed, any streaming tokens are discarded, and the API calls are aborted. The router delivers the winning response with no further delay.
The key insight: the losing models' partial work has zero value. Unlike a redundant database write (which cannot be "uncancelled" once committed), an in-flight LLM inference can be aborted at any point — the HTTP connection is stateless from the application's perspective. Cancellation is free, and the latency benefit of not waiting for stragglers is significant. In a routing system where the three models differ by 200..2000ms in response time, the cancelling discriminator captures the full latency advantage of hedged requests with no resource leakage penalty.
| Metric | Signal |
|---|---|
| P50 / P95 delivery latency | Primary signal — should track below the fastest individual endpoint's P50 due to hedging benefit |
| Win rate by model | Which endpoint wins most often — reveals systematic latency asymmetry; if one model wins 90%+, the others add cost without benefit |
| Cancelled token waste ratio | Tokens generated by losing branches as fraction of total tokens billed — measures cost efficiency of hedging |
| Cancellation failure rate | Fraction of cancel signals that failed to stop branch execution — non-zero values indicate resource leakage |
| Node | What it does | What it receives | What it produces |
|---|---|---|---|
| Dispatch Prompt | Fans the prompt to all three model endpoints simultaneously via AND-split. Opens parallel HTTP connections. | Single user prompt | Three concurrent inference requests |
| GPT-4 / Claude / Gemini | Each model executes the inference independently. May be cancelled mid-stream by the discriminator. | Prompt + model-specific parameters | Full response token (if not cancelled) |
| First Response Wins (OR-join) | Fires on the first arriving complete response. Immediately signals cancellation to the remaining two branches. Does not wait. | First arriving model response | Winner response + cancellation signal to losers |
| Deliver Output | Returns the winning response to the caller. Executes without waiting for cancellation to confirm. | Winner response from OR-join | Response to user |
| Origin of Value | Where it appears | How it is captured |
|---|---|---|
| Future Cashflow | User-facing latency | The discriminator's P50 latency equals the fastest endpoint's P50. For multi-model routing, this is typically 30..60% faster than single-model P50. |
| Conditional Action | Cancelled branches | Cancelled branches consume partial compute. If providers charge per-token on partial completions, the "free cancellation" assumption breaks. Validate pricing model before deploying. |
| Risk Exposure | OR-join fire condition | If the first arriving response is malformed or truncated, it is delivered as the winner. No quality check runs before delivery. Fix: add a lightweight response validator inside each branch before the response token is emitted to the OR-join. |
| Governance | Cancellation mechanism | The reliability of cancellation is a system-level governance assumption. If cancellation silently fails (branch continues running), the pattern degrades to an untracked 20.24 without cleanup tracking. |
Hedged requests at scale. At 1000 requests/minute, a cancelling discriminator over three endpoints effectively triples the inference capacity available for latency reduction — each request "samples" from three models simultaneously. The cost premium is bounded by the fraction of cancelled tokens, which decreases as model latency variance decreases.
The OR-join fires and signals cancellation, but the losing branches continue processing for 50..200ms before the cancellation signal propagates. During this window, the losing models may complete and attempt to write their results. Fix: the OR-join should hold a "fired" flag that causes any late-arriving branch completions to be silently discarded rather than re-triggering delivery.
All three models return error responses (rate limit, context length exceeded, service unavailable). The OR-join fires on the first error and delivers it as the "winner". Fix: add error classification inside each branch — only success tokens are emitted to the OR-join. Error tokens route to a fallback handler. If all three fail, the fallback handler takes over.
Provider billing charges per input token even for cancelled requests. With three parallel dispatches, every request is billed 3x on input tokens regardless of which model wins. Fix: route low-latency-expectation queries to a single fast model by default, and only invoke the hedged discriminator when the routing policy flags a high-value or latency-sensitive request.
The fastest model is consistently the lowest-quality one (e.g., a quantized local model wins 80% of the time). The discriminator optimizes for speed and inadvertently degrades response quality. Fix: add a minimum quality filter before the OR-join fire, or weight the selection function to discount branches from models with historically lower quality scores.
| Variant | Modification | When to use |
|---|---|---|
| Quality-Filtered Discriminator | Branch result passes a lightweight validator before being eligible to win; invalid responses are suppressed | First-arrival alone is not a sufficient selection criterion — minimum quality must be enforced |
| Tiered Hedging | Primary model is dispatched first; backup models are dispatched only after a timeout threshold (e.g., 500ms with no response) | Hedging cost must be minimized; backup dispatch only when primary is slow |
| Cancelling Discriminator with Audit | Winning response and the identities of cancelled branches are logged before delivery | Model comparison research requires tracking which model would have won on each request |
| Pattern | Relationship |
|---|---|
| 40.44 Blocking Discriminator | The wait-for-all variant — use when branches have side effects that cannot be safely cancelled |
| 20.24 Competitive Evaluation | Higher-level pattern: runs all branches to completion and selects best output — use when all responses have value |
| 10.14 Retry Fallback | Sequential fallback — slower but avoids duplicate compute cost entirely; use when hedging cost is not justified |