60.65 Complete Multiple Instance Activity — AI-Native Organization Patterns

A multiple-instance task is forcibly completed: remaining unfinished instances are withdrawn and control passes to subsequent tasks. Triggering condition: a threshold is met (enough instances have completed) or an external signal declares the task complete. Completed instances are accepted; partial data from the remaining withdrawn instances is not delivered. The MI task as a whole transitions to "done".

Motivating Scenario

An A/B test evaluation system runs 100 parallel user simulation agents — each simulates a single user session under test or control conditions. The evaluation criterion is statistical significance, not full sample completion. After 60 simulations complete, a significance calculator determines that the observed effect size exceeds the pre-registered threshold: the test has enough power to draw a conclusion. The remaining 40 agents are force-completed: their in-progress sessions are terminated and their partial data is not included. Analysis proceeds on the 60 completed simulations.

The key insight: the system does not need all 100 to answer the question. Waiting for the remaining 40 is pure waste — additional data beyond significance does not improve the decision and may introduce recency bias if conditions change. The force-complete trigger is not a failure event; it is a success event. The MI task has achieved its purpose (sufficient statistical power) and should close. Withdrawing the remaining instances is the correct behavior, not a degraded fallback.

Structure

Key Metrics

When to Use

Metric	Signal
Early stopping rate	Fraction of test runs that force-complete before max instances. High rate means the threshold is well-calibrated for the expected effect size.
Instance completion count at force-complete	Average N at threshold trigger. Tracks cost efficiency — lower N means earlier stopping and lower cost per test.
Decision reversal rate	Fraction of early-stopped decisions later contradicted by full-N follow-up tests. Non-zero rate quantifies the bias cost of early stopping.
Withdrawn instance side-effect rate	Fraction of withdrawn instances that committed data to shared stores before withdrawal. Should be zero — any non-zero count indicates a transactional isolation failure.

Node	What it does	What it receives	What it produces
Launch Simulations	AND-split: spawns 100 User Simulation instances in parallel. Each instance receives a unique user profile and test variant assignment. All 100 are active within the MI task boundary.	Test configuration: variants, user profiles, session parameters	100 parallel simulation instance tokens
User Simulation (×100)	MI task: each instance simulates one user session, recording interaction events, conversion signals, and behavioral metrics. Instances complete asynchronously. Minimum threshold: 60 completions. Maximum: 100. Can be force-completed before max is reached.	User profile, variant assignment, session parameters	Session record: events, conversion outcome, behavioral trace
Force Complete	Evaluates whether the significance threshold has been met after each new completion. XOR-split: routes to Statistical Analysis if significance is confirmed (force-close the MI task); routes to A/B Report if all 100 complete naturally. In both cases, remaining instances are withdrawn.	Running completion count + significance test result	Force-complete signal to MI task; completion dataset for analysis or reporting
Statistical Analysis	Runs the pre-registered significance test on the completed simulation records. Produces effect size, confidence intervals, and a go/no-go recommendation for the tested variant.	Completed session records (60..99 instances)	Analysis report: effect size, p-value, confidence interval, recommendation
A/B Report	XOR-join: receives either the analysis report (early-stopped path) or the natural-completion data (all 100 done path). Formats the final test report with methodology, results, and recommendation. Records whether the test reached natural or early completion.	Analysis report or full simulation dataset	Final A/B test report with decision recommendation

Use when

A subset of instance completions is sufficient for the downstream decision — not all instances need to complete
A pre-registered threshold defines "sufficient" (statistical power, majority vote, minimum sample size)
Continuing after the threshold is met adds cost without improving the decision quality
Withdrawn instances' partial data is not needed — only completed instances contribute to the result
The force-complete event is a success signal, not an error or user-initiated cancel

Avoid when

All instance completions are required for correctness — use AND-join to wait for full completion
The stopping threshold depends on the data seen so far (adaptive stopping) — this introduces bias and requires explicit statistical correction
Withdrawn instances have already produced side effects that need rollback
The caller distinguishes "force-completed with 60" from "fully completed with 100" — both paths must produce the same downstream behavior, or the XOR-join cannot correctly unify them

Value Profile

Dynamics and Failure Modes

Variants

Related Patterns

Investment Signal

Origin of Value	Where it appears	How it is captured
Future Cashflow	Statistical Analysis node	Value is the test decision — does this variant ship? The decision quality depends on statistical power, not sample size beyond the significance threshold. Force-completing at 60 produces the same decision quality as completing all 100, while spending 40% less on the remaining simulations. The cost savings are captured immediately on the threshold trigger.
Governance	Force Complete decision node	The stopping threshold is the governance artifact. It encodes the statistical methodology — the significance level, power level, and minimum effect size. This threshold must be registered before the test begins (pre-registration), not determined adaptively from results. Post-hoc threshold setting introduces p-hacking risk — the governance value depends entirely on threshold integrity.
Conditional Action	User Simulation instances	Each running instance is compute spend. Early stopping via force-complete reduces the expected cost from 100 simulations to the expected completion count at the significance threshold — often well below 100 for large effect sizes. Cost reduction is the primary economic motivation for the pattern.
Risk Exposure	Force Complete (threshold evaluation)	The threshold trigger is evaluated on noisy data. An early significant result may be a false positive — the test effect at 60 completions may regress by completion 100. Force-completing on a false positive produces a wrong decision with high confidence. Pre-registering the threshold is the risk mitigation mechanism, not a formality.

Variant	Modification	When to use
Majority-Vote Force Complete	Threshold is a majority fraction (e.g., >50%) rather than a statistical significance test. The MI task force-completes when more than half the instances return the same result.	Classification tasks where consensus defines correctness — no statistical analysis is needed; the mode of completed instance results is the output
Time-Bounded Force Complete	The MI task force-completes after a wall-clock deadline, accepting however many instances have completed by that time.	SLA-bound tasks where time is the primary constraint — the test must conclude by a deadline regardless of completion count or significance
External Signal Force Complete	An external event (not an internal threshold) triggers the force-complete. Example: a competitor launches a product, making the A/B test obsolete — force-complete and use whatever data exists.	Tests that can be made irrelevant by external events. The force-complete is a contingency mechanism for environmental changes, not a normal stopping rule.
Cascading Force Complete	Force-completing one MI task triggers force-complete evaluation in a downstream MI task. The stopping rule propagates through the process hierarchy.	Multi-stage experiments where an early result in stage 1 makes stage 2 redundant. Stage 1 force-complete fires stage 2 force-complete automatically.

Pattern	Relationship
80.86 Cancel MI Activity	User-initiated cancel of MI instances while preserving completed outputs. 60.65 is threshold-triggered and declares the whole MI task "done"; 60.64 stops further spending but does not close the MI task as complete.
40.43 Structured Discriminator	First-wins semantics for regular (non-MI) parallel splits. Force-complete is the MI-specific analog: threshold-wins, not first-wins.
20.24 Competitive Evaluation	Multiple agents producing competing outputs — the best result wins and the rest are discarded. Combine with 60.65 to force-complete after a quality threshold is met rather than waiting for all agents to finish.
10.15 Evaluator-Optimizer	Closed-loop quality improvement. 60.65 force-completes one MI execution pass; Evaluator-Optimizer loops the force-completed result back for refinement if quality is insufficient.

Force-complete MI activity is the formal mechanism for early stopping in AI evaluation systems. Sequential testing — running evaluations until a sufficient signal is obtained rather than to a fixed sample — is the standard methodology in modern A/B testing, model evaluation, and reinforcement learning benchmarking. Organizations that implement force-complete correctly can run twice as many experiments with the same compute budget by terminating large-effect tests early and reallocating to new tests. This is a compounding advantage in product iteration speed.

The intellectual property is in the threshold calibration. The stopping rule — what significance level, what effect size, what correction for multiple comparisons — is the algorithmic asset. Organizations that have accumulated data on expected effect distributions for their product category can calibrate thresholds more tightly, stopping earlier on average without increasing false positive rates. This is a proprietary statistical advantage that compounds with experimentation volume.

Red flag: a system that evaluates significance repeatedly without a sequential correction is running biased experiments. Any A/B testing infrastructure that uses a fixed p-value threshold applied after each new data point — without a valid sequential testing methodology — is producing systematically inflated positive rates. This is a common failure mode in ad-hoc agentic evaluation systems that implement "stop when significant" without understanding the statistical implications. Due diligence on evaluation infrastructure should verify the stopping rule methodology before trusting reported significance rates.