10.15 Evaluator-Optimizer

A Generator produces a candidate. A Critic evaluates it against explicit criteria. If criteria are not met, the Critic's structured feedback returns to the Generator for revision. The loop runs until acceptance or budget exhaustion.

Motivating Scenario

A law firm uses an AI agent to draft contract clauses. First-pass output is often legally coherent but misses firm-specific conventions: the wrong indemnification cap formula, missing governing law clause, non-standard force majeure language. A human partner reviewing each draft takes 20 minutes. There are 80 drafts per day.

The solution: a Critic agent trained on 3 years of partner-reviewed contracts. It checks each draft against 47 specific criteria: required clause presence, formula correctness, language style against the firm's clause library. When it finds issues, it returns a structured list of deltas — not "rewrite this" but "clause 4.2 missing indemnification cap formula; replace with standard formula from exhibit B." The Generator receives this delta and revises. After 2..3 iterations, the draft passes. Partner review time drops to 4 minutes — spot-check only.

Structure

Zoom and pan enabled · Concrete example: contract clause drafting with legal Critic

Key Metrics

Metric	Signal
Accept rate at iteration N	How many cycles to convergence — target: >80% accept by iteration 2
Criteria pass rate per criterion	Which of the 47 criteria fail most often — drives Generator training focus
Token cost per accepted output	ROI of the loop vs. single-pass + human review
Human escalation rate	Budget exhaustion rate — rising rate signals Critic over-strictness or Generator drift

Node	What it does	Input	Output
Contract Drafter	Produces a full contract clause set from deal parameters. On subsequent iterations, incorporates Critic feedback deltas.	Deal parameters + prior Critic feedback (on revision)	Full clause draft in firm's standard format
Legal Critic	Checks draft against 47 criteria: clause presence, formula correctness, language against clause library. Produces a structured delta list, not a prose critique.	Draft + firm clause library + deal parameters	Pass verdict OR structured delta list: {clause, issue, required change}
Iteration Gate	Checks Critic verdict. If pass: route to output. If fail: check iteration count. If count ≤ 3: loop back. If count > 3: route to human escalation.	Critic verdict + iteration count	Route to: Generator (revise) / Output (accept) / Human review (escalate)

When to Use

Use when

Explicit, articulable quality criteria exist — can be encoded in a Critic
First-pass output is consistently below target without iteration
Iterative refinement demonstrably improves output
Latency is not a hard constraint — loop adds multiple inference cycles
The Critic can reliably detect improvement (not just surface changes)

Avoid when

Quality criteria cannot be formalized — Critic will be inconsistent
Single-pass output already meets target quality
Generator and Critic are the same model — correlated failures, reward hacking
Hard latency requirements — each loop iteration is a full inference cycle

Value Profile

Origin of Value	Where it appears	How it is captured
Future Cashflow	Each revision iteration	Quality compounds across iterations. The second draft is better than the first because the Critic's delta is precise. Value is realized at acceptance, not at generation — intermediate drafts are intermediate costs.
Governance	Legal Critic node	The Critic encodes the firm's quality standard — 47 criteria representing 3 years of partner judgment. This is the governance constraint. A well-calibrated Critic encoding institutional standards is the primary IP of this pattern.
Conditional Action	Each iteration cycle	Each loop is a compute expenditure. The loop is only justified if quality improvement per cycle exceeds marginal cost. At 2.3 iterations average with 80% accept rate, the math must hold.
Risk Exposure	Reward hacking surface	Generator learns to satisfy Critic criteria without improving actual quality — passes all 47 checks by gaming surface features while missing the spirit. Detection: human spot-check of accepted drafts monthly.

VCM analog: Work Token with Quality Gate. The Generator must meet the Critic's standard before its output is accepted — structurally identical to a Work Token system where nodes earn their reward only after passing a quality threshold.

Dynamics and Failure Modes

Reward hacking

After 500 iterations, the Generator has implicitly learned which of the 47 Critic criteria are checked superficially vs. substantively. It starts producing drafts that pass all criteria by inserting the correct formula strings in the right positions — but the surrounding context makes the clause legally incoherent. The Critic passes; the partner catches it on review. Fix: Critic must evaluate semantic correctness, not pattern matching. Human spot-check of 5% of accepted drafts is the detection mechanism — if the human rejection rate climbs, reward hacking is occurring.

Critic drift across iterations

On iteration 4 of a long contract, the Critic's accumulated context window causes it to evaluate clause 18 more leniently than it would in isolation — it has been evaluating for a while and implicitly adjusts its standards. Accepted draft on iteration 4 would have failed on iteration 1. Fix: Critic prompt must be stateless with respect to iteration count. Instantiate a fresh Critic for each evaluation with no history of prior rounds.

Budget ceiling on complex contracts

A 200-clause M&A agreement hits the 3-iteration limit without passing all criteria. The Iteration Gate escalates to human review. The human receives a draft that has passed 44/47 criteria — the 3 remaining issues are documented in the Critic's last delta. Partner fixes in 8 minutes instead of 20. Budget exhaustion is not failure — it is a known, documented handoff mode.

Variants

Variant	Modification	When to use
Multi-Critic	Multiple independent Critics evaluate in parallel (legal, style, risk); Generator receives aggregated delta	Quality has multiple independent dimensions — one Critic cannot cover all criteria credibly
Progressive Critic	Critic applies easier criteria first; stricter criteria only after earlier rounds pass	Avoid spending compute on complex criteria when basic criteria fail — fail fast on obvious issues
Human-Critic Hybrid	Automated Critic handles first N iterations; human activates only after automated pass	Human review is expensive but required for final acceptance — automate the easy rejections

Related Patterns

Pattern	Relationship
10.11 Pipeline	Base structure — Evaluator-Optimizer adds a feedback loop after the last pipeline stage
30.31 Feedback Loop	Organizational-level analog — the Critic's judgments, aggregated over time, become the training signal for the Feedback Loop that improves the Generator's base model

Investment Signal

The Critic is the primary IP in this pattern. A Generator is available from model APIs; a calibrated Critic encoding institutional quality standards is built from accumulated judgment and is proprietary.

How was the Critic trained? If it was hand-crafted from first principles by one person, it is fragile and person-dependent. If it was calibrated against thousands of historical accepted/rejected outputs with human-labeled deltas, it encodes institutional knowledge and compounds over time.

Acquisition thesis: acquiring the Critic is acquiring the quality standard. Plugging a better Generator into an existing Critic is trivial. Building a new Critic from scratch requires months of labeling and calibration work — this is the moat.