A multiple-instance task is forcibly completed: remaining unfinished instances are withdrawn and control passes to subsequent tasks. Triggering condition: a threshold is met (enough instances have completed) or an external signal declares the task complete. Completed instances are accepted; partial data from the remaining withdrawn instances is not delivered. The MI task as a whole transitions to "done".
An A/B test evaluation system runs 100 parallel user simulation agents — each simulates a single user session under test or control conditions. The evaluation criterion is statistical significance, not full sample completion. After 60 simulations complete, a significance calculator determines that the observed effect size exceeds the pre-registered threshold: the test has enough power to draw a conclusion. The remaining 40 agents are force-completed: their in-progress sessions are terminated and their partial data is not included. Analysis proceeds on the 60 completed simulations.
The key insight: the system does not need all 100 to answer the question. Waiting for the remaining 40 is pure waste — additional data beyond significance does not improve the decision and may introduce recency bias if conditions change. The force-complete trigger is not a failure event; it is a success event. The MI task has achieved its purpose (sufficient statistical power) and should close. Withdrawing the remaining instances is the correct behavior, not a degraded fallback.
| Metric | Signal |
|---|---|
| Early stopping rate | Fraction of test runs that force-complete before max instances. High rate means the threshold is well-calibrated for the expected effect size. |
| Instance completion count at force-complete | Average N at threshold trigger. Tracks cost efficiency — lower N means earlier stopping and lower cost per test. |
| Decision reversal rate | Fraction of early-stopped decisions later contradicted by full-N follow-up tests. Non-zero rate quantifies the bias cost of early stopping. |
| Withdrawn instance side-effect rate | Fraction of withdrawn instances that committed data to shared stores before withdrawal. Should be zero — any non-zero count indicates a transactional isolation failure. |
| Node | What it does | What it receives | What it produces |
|---|---|---|---|
| Launch Simulations | AND-split: spawns 100 User Simulation instances in parallel. Each instance receives a unique user profile and test variant assignment. All 100 are active within the MI task boundary. | Test configuration: variants, user profiles, session parameters | 100 parallel simulation instance tokens |
| User Simulation (×100) | MI task: each instance simulates one user session, recording interaction events, conversion signals, and behavioral metrics. Instances complete asynchronously. Minimum threshold: 60 completions. Maximum: 100. Can be force-completed before max is reached. | User profile, variant assignment, session parameters | Session record: events, conversion outcome, behavioral trace |
| Force Complete | Evaluates whether the significance threshold has been met after each new completion. XOR-split: routes to Statistical Analysis if significance is confirmed (force-close the MI task); routes to A/B Report if all 100 complete naturally. In both cases, remaining instances are withdrawn. | Running completion count + significance test result | Force-complete signal to MI task; completion dataset for analysis or reporting |
| Statistical Analysis | Runs the pre-registered significance test on the completed simulation records. Produces effect size, confidence intervals, and a go/no-go recommendation for the tested variant. | Completed session records (60..99 instances) | Analysis report: effect size, p-value, confidence interval, recommendation |
| A/B Report | XOR-join: receives either the analysis report (early-stopped path) or the natural-completion data (all 100 done path). Formats the final test report with methodology, results, and recommendation. Records whether the test reached natural or early completion. | Analysis report or full simulation dataset | Final A/B test report with decision recommendation |
| Origin of Value | Where it appears | How it is captured |
|---|---|---|
| Future Cashflow | Statistical Analysis node | Value is the test decision — does this variant ship? The decision quality depends on statistical power, not sample size beyond the significance threshold. Force-completing at 60 produces the same decision quality as completing all 100, while spending 40% less on the remaining simulations. The cost savings are captured immediately on the threshold trigger. |
| Governance | Force Complete decision node | The stopping threshold is the governance artifact. It encodes the statistical methodology — the significance level, power level, and minimum effect size. This threshold must be registered before the test begins (pre-registration), not determined adaptively from results. Post-hoc threshold setting introduces p-hacking risk — the governance value depends entirely on threshold integrity. |
| Conditional Action | User Simulation instances | Each running instance is compute spend. Early stopping via force-complete reduces the expected cost from 100 simulations to the expected completion count at the significance threshold — often well below 100 for large effect sizes. Cost reduction is the primary economic motivation for the pattern. |
| Risk Exposure | Force Complete (threshold evaluation) | The threshold trigger is evaluated on noisy data. An early significant result may be a false positive — the test effect at 60 completions may regress by completion 100. Force-completing on a false positive produces a wrong decision with high confidence. Pre-registering the threshold is the risk mitigation mechanism, not a formality. |
VCM analog: Early redemption of MI tokens. The 100 simulation instances each hold a work token. When the significance threshold fires, 60 tokens are redeemed for value (completed simulations). The remaining 40 tokens are withdrawn without redemption — their potential value is forfeited in exchange for cost savings and speed. The early redemption rate is the expected fraction of tokens that produce value before the threshold fires.
The Force Complete node evaluates significance after each new completion — 61st, 62nd, 63rd, and so on. Each evaluation is a hypothesis test. Running multiple tests on accumulating data inflates the type-I error rate: with a nominal p < 0.05 threshold, running 40 interim checks yields an effective false positive rate much higher than 5%. The test reports significance when none exists. Fix: use a sequential testing procedure (e.g., SPRT, group sequential design) rather than a fixed-threshold test applied repeatedly. The threshold must account for the number of planned interim analyses.
A system misconfiguration sets the minimum instance threshold to 1 instead of 60. The first simulation to complete triggers the significance evaluator, which produces a "significant" result with N=1. The MI task is force-completed with a single data point. Fix: minimum instance count must be a hard constraint — the force-complete trigger cannot fire until the minimum has been met, regardless of any significance evaluation result.
Each simulation instance writes events to a shared analytics database as it runs, before the instance completes. The Force Complete signal withdraws 40 instances that have already written partial event streams. The analytics database contains partial session records for 40 users. The Statistical Analysis runs on the 60 completed sessions, but the database has 100 partial or complete session records. Downstream reporting queries find an inconsistency. Fix: simulation instances must not commit to shared stores until the instance completes and the commit is explicitly confirmed. Pre-completion writes must be written to an instance-private staging area.
At 59 completions, significance is evaluated but not triggered. Instance 60 completes at the same moment as instances 61..65 — a burst of completions. Significance is triggered on 60, but the force-complete signal arrives at the MI task while instances 61..65 have just committed. The Force Complete node has 60 in scope but the collector receives 65. Fix: force-complete must atomically snapshot the completion count at threshold time. Instances completing after the snapshot are either included (if they arrive before the withdrawal signal is processed) or excluded (withdrawn). The snapshot count must be the authoritative input to the Statistical Analysis, regardless of how many instances ultimately committed.
| Variant | Modification | When to use |
|---|---|---|
| Majority-Vote Force Complete | Threshold is a majority fraction (e.g., >50%) rather than a statistical significance test. The MI task force-completes when more than half the instances return the same result. | Classification tasks where consensus defines correctness — no statistical analysis is needed; the mode of completed instance results is the output |
| Time-Bounded Force Complete | The MI task force-completes after a wall-clock deadline, accepting however many instances have completed by that time. | SLA-bound tasks where time is the primary constraint — the test must conclude by a deadline regardless of completion count or significance |
| External Signal Force Complete | An external event (not an internal threshold) triggers the force-complete. Example: a competitor launches a product, making the A/B test obsolete — force-complete and use whatever data exists. | Tests that can be made irrelevant by external events. The force-complete is a contingency mechanism for environmental changes, not a normal stopping rule. |
| Cascading Force Complete | Force-completing one MI task triggers force-complete evaluation in a downstream MI task. The stopping rule propagates through the process hierarchy. | Multi-stage experiments where an early result in stage 1 makes stage 2 redundant. Stage 1 force-complete fires stage 2 force-complete automatically. |
| Pattern | Relationship |
|---|---|
| 80.86 Cancel MI Activity | User-initiated cancel of MI instances while preserving completed outputs. 60.65 is threshold-triggered and declares the whole MI task "done"; 60.64 stops further spending but does not close the MI task as complete. |
| 40.43 Structured Discriminator | First-wins semantics for regular (non-MI) parallel splits. Force-complete is the MI-specific analog: threshold-wins, not first-wins. |
| 20.24 Competitive Evaluation | Multiple agents producing competing outputs — the best result wins and the rest are discarded. Combine with 60.65 to force-complete after a quality threshold is met rather than waiting for all agents to finish. |
| 10.15 Evaluator-Optimizer | Closed-loop quality improvement. 60.65 force-completes one MI execution pass; Evaluator-Optimizer loops the force-completed result back for refinement if quality is insufficient. |
Force-complete MI activity is the formal mechanism for early stopping in AI evaluation systems. Sequential testing — running evaluations until a sufficient signal is obtained rather than to a fixed sample — is the standard methodology in modern A/B testing, model evaluation, and reinforcement learning benchmarking. Organizations that implement force-complete correctly can run twice as many experiments with the same compute budget by terminating large-effect tests early and reallocating to new tests. This is a compounding advantage in product iteration speed.
The intellectual property is in the threshold calibration. The stopping rule — what significance level, what effect size, what correction for multiple comparisons — is the algorithmic asset. Organizations that have accumulated data on expected effect distributions for their product category can calibrate thresholds more tightly, stopping earlier on average without increasing false positive rates. This is a proprietary statistical advantage that compounds with experimentation volume.
Red flag: a system that evaluates significance repeatedly without a sequential correction is running biased experiments. Any A/B testing infrastructure that uses a fixed p-value threshold applied after each new data point — without a valid sequential testing methodology — is producing systematically inflated positive rates. This is a common failure mode in ad-hoc agentic evaluation systems that implement "stop when significant" without understanding the statistical implications. Due diligence on evaluation infrastructure should verify the stopping rule methodology before trusting reported significance rates.