N generator agents produce candidate outputs in parallel; a Judge agent evaluates all candidates and selects the best. Remaining candidates are discarded. Quality improves with N at the cost of N× compute.
A law firm generates first drafts of non-disclosure agreements. Junior associate first drafts take 2 hours each and vary widely in quality: senior partner acceptance rate is 61%. Three parallel drafters - specialized in IP, employment, and commercial terms respectively - produce drafts simultaneously in 4 minutes. A Judge agent scores each draft against 38 firm-specific criteria. Senior partner acceptance rate for the winning draft: 89%. Total compute cost: 3× single draft. Time saved: 1 hour 56 minutes per NDA.
The key insight: the drafters are not redundant - they represent different coverage strategies. The IP specialist surfaces clause gaps a commercial drafter would miss. The employment specialist flags confidentiality scope issues irrelevant to a pure IP lens. The Judge does not average them; it selects the strongest output against a fixed scoring rubric. Quality gain is not from redundancy but from coverage diversity.
| Metric | Signal |
|---|---|
| Winner acceptance rate vs baseline | Primary quality signal - does competitive evaluation beat single-agent output at the downstream decision point? |
| Quality score distribution | Spread of Judge scores across candidates - measures generator diversity. Tight distribution signals convergence failure. |
| Judge consistency | Agreement rate across identical candidate sets on repeated runs - measures scoring stability, not candidate quality. |
| Cost-per-quality-point | Marginal quality gain per additional generator divided by marginal compute cost - determines optimal N. |
| Node | What it does | What it receives | What it produces |
|---|---|---|---|
| Spawn Drafters | AND-splits the brief to all parallel drafters simultaneously | Draft brief | Brief copy dispatched to each specialist lane |
| IP Specialist | Drafts NDA from IP ownership and patent disclosure angle | Draft brief | IP-focused NDA draft |
| Employment Specialist | Drafts NDA from employee obligations and non-compete angle | Draft brief | Employment-focused NDA draft |
| Commercial Specialist | Drafts NDA from commercial terms and liability angle | Draft brief | Commercial-focused NDA draft |
| Collect Drafts | AND-join: assembles all candidate drafts into a single evaluation set | 3 NDA drafts | Candidate set for judging |
| Judge | Scores each draft against 38 firm-specific criteria, selects the winner | Candidate set + scoring rubric | Winning draft + per-criterion scores + improvement notes |
| Winner Refinement | Refines the winning draft using the Judge's improvement notes | Winning draft + judge feedback | Polished NDA draft ready for partner review |
| Origin of Value | Where it appears | How it is captured |
|---|---|---|
| Future Cashflow | Winner quality vs single-agent baseline | The winning draft's acceptance rate is the primary value signal. Competitive evaluation is only justified when the gap between winner and baseline exceeds the N× compute cost. Measure acceptance rate delta, not average draft quality. |
| Governance | Judge node | The Judge encodes the firm's quality criteria as a scoring rubric. The 38 criteria are the governance layer - they define what "correct" means. The rubric is an organizational asset, not a technical one. Rubric drift is a governance failure, not a model failure. |
| Risk Exposure | Generator diversity | If all generators produce similar drafts, the competitive advantage collapses. Generator diversity failure is the primary risk - the pattern degrades to single-agent performance at 3× cost. Monitor quality score distribution across candidates to detect convergence. |
| Conditional Action | N× parallel compute | Each generator invocation is a cost event. Unlike Pipeline stages, losing candidates are pure cost with no reuse value. Cost is proportional to N regardless of winner margin. Track cost-per-quality-point to determine optimal N. |
VCM analog: Work Token with quality auction. Each generator competes for the single output slot. Only the winning Work Token is realized as value - all others are sunk cost. The Judge is the auctioneer. Increasing N raises the auction field quality but also raises sunk cost linearly. The equilibrium N is where marginal quality gain per additional generator equals marginal compute cost.
The Judge is trained or prompted with examples drawn from the same distribution as the generators. Selection becomes random noise - the Judge scores drafts highly when they match its own generation style, not when they best satisfy client criteria. The firm observes 89% acceptance rate in testing but 62% in production, where partner preferences differ from the Judge's training set. Fix: ground the Judge's rubric in explicit, externally validated criteria (e.g., partner review history, precedent clause libraries). Never use generator outputs to calibrate the Judge.
Under compute-budget pressure, the system is reconfigured to run a single generator before the Judge. The Judge still runs - but it now selects the only candidate, making the pattern functionally identical to a plain pipeline with an extra scoring step. Quality reverts to single-agent baseline, but the Judge overhead remains. Fix: if compute must be cut, remove the Judge entirely rather than reducing N to 1. A Judge evaluating one candidate adds cost without competitive gain.
N× cost growth is not tracked against quality improvement. The firm adds a fourth specialist generator to improve coverage, observing a 2-point acceptance rate lift - but the marginal compute cost of the fourth generator exceeds the value of 2 additional accepted NDAs per month. The pattern is running at a loss per draft but no one has measured it. Fix: track cost-per-quality-point as a live operational metric. Each additional generator must clear a cost-benefit threshold before being added to the pool.
The Judge selects a different winning draft on identical inputs across runs due to non-deterministic scoring (temperature > 0). On re-run, the IP specialist wins; on the original run, the commercial specialist won. The winning draft accepted by the partner cannot be reproduced from the same brief. Fix: run the Judge at temperature 0 for production scoring. If scoring is inherently subjective and non-deterministic, implement a majority-vote Judge (run Judge N times, select the most frequently chosen winner).
| Variant | Modification | When to use |
|---|---|---|
| Knockout Tournament | 8 generators produce candidates; 4 judges pair-compare to yield 4 finalists; 2 judges yield 2; final judge selects winner | Large candidate pools where scoring all N against each other is cost-prohibitive; pairwise comparison is cheaper than absolute scoring |
| Diverse Sampling | Same model runs N times with different temperatures or system prompts rather than different specialist models | No specialist models available; diversity is achieved through stochastic variation rather than architectural specialization |
| Best-of-N with Threshold | Judge runs after each generator; if any candidate exceeds the quality threshold, stop and skip remaining generators | Early stopping is acceptable; first-qualifying candidate is sufficient; reduces expected compute cost when quality threshold is reachable early |
| Pattern | Relationship |
|---|---|
| 10.15 Evaluator-Optimizer | Iterative improvement vs competitive selection - Evaluator-Optimizer refines one candidate through critique cycles; Competitive Evaluation selects the best from N independent attempts |
| 20.23 Orchestrator-Workers | The Orchestrator dispatches workers to independent tasks; in Competitive Evaluation the Judge is a specialized Orchestrator whose only task is selection among parallel outputs |
| 20.25 Consensus | Agreement required vs winner selection - Consensus accepts output when M validators agree on it; Competitive Evaluation selects the best candidate regardless of inter-generator agreement |
The Judge's rubric is the organizational IP. Generators are commodity compute - replaceable with any model that produces the right output format. The rubric encodes the firm's quality standards: what constitutes an acceptable NDA, what clauses are non-negotiable, what weightings reflect client risk tolerance. This rubric is built from years of partner review decisions and is not transferable to a competitor without the underlying case history.
Acquirers should evaluate the rubric quality, not the generator quality. A firm running Competitive Evaluation with a weak rubric is paying 3× compute for random selection. A firm with a well-grounded rubric is paying 3× compute for a measurable quality lift with an auditable decision mechanism.
Red flag: a competitive evaluation system with no rubric versioning history cannot demonstrate that quality improvements over time are attributable to rubric refinement vs. model upgrades vs. base rate changes in incoming work. Without separating these signals, quality attribution is impossible and the system's improvement trajectory cannot be priced.