10.15 Evaluator-Optimizer

A Generator produces a candidate. A Critic evaluates it against explicit criteria. If criteria are not met, the Critic's structured feedback returns to the Generator for revision. The loop runs until acceptance or budget exhaustion.


Motivating Scenario

A law firm uses an AI agent to draft contract clauses. First-pass output is often legally coherent but misses firm-specific conventions: the wrong indemnification cap formula, missing governing law clause, non-standard force majeure language. A human partner reviewing each draft takes 20 minutes. There are 80 drafts per day.

The solution: a Critic agent trained on 3 years of partner-reviewed contracts. It checks each draft against 47 specific criteria: required clause presence, formula correctness, language style against the firm's clause library. When it finds issues, it returns a structured list of deltas — not "rewrite this" but "clause 4.2 missing indemnification cap formula; replace with standard formula from exhibit B." The Generator receives this delta and revises. After 2..3 iterations, the draft passes. Partner review time drops to 4 minutes — spot-check only.

Structure

Zoom and pan enabled · Concrete example: contract clause drafting with legal Critic

Key Metrics

MetricSignal
Accept rate at iteration N How many cycles to convergence — target: >80% accept by iteration 2
Criteria pass rate per criterion Which of the 47 criteria fail most often — drives Generator training focus
Token cost per accepted output ROI of the loop vs. single-pass + human review
Human escalation rate Budget exhaustion rate — rising rate signals Critic over-strictness or Generator drift
NodeWhat it doesInputOutput
Contract Drafter Produces a full contract clause set from deal parameters. On subsequent iterations, incorporates Critic feedback deltas. Deal parameters + prior Critic feedback (on revision) Full clause draft in firm's standard format
Legal Critic Checks draft against 47 criteria: clause presence, formula correctness, language against clause library. Produces a structured delta list, not a prose critique. Draft + firm clause library + deal parameters Pass verdict OR structured delta list: {clause, issue, required change}
Iteration Gate Checks Critic verdict. If pass: route to output. If fail: check iteration count. If count ≤ 3: loop back. If count > 3: route to human escalation. Critic verdict + iteration count Route to: Generator (revise) / Output (accept) / Human review (escalate)

When to Use

Use when
Avoid when

Value Profile

Origin of ValueWhere it appearsHow it is captured
Future Cashflow Each revision iteration Quality compounds across iterations. The second draft is better than the first because the Critic's delta is precise. Value is realized at acceptance, not at generation — intermediate drafts are intermediate costs.
Governance Legal Critic node The Critic encodes the firm's quality standard — 47 criteria representing 3 years of partner judgment. This is the governance constraint. A well-calibrated Critic encoding institutional standards is the primary IP of this pattern.
Conditional Action Each iteration cycle Each loop is a compute expenditure. The loop is only justified if quality improvement per cycle exceeds marginal cost. At 2.3 iterations average with 80% accept rate, the math must hold.
Risk Exposure Reward hacking surface Generator learns to satisfy Critic criteria without improving actual quality — passes all 47 checks by gaming surface features while missing the spirit. Detection: human spot-check of accepted drafts monthly.
VCM analog: Work Token with Quality Gate. The Generator must meet the Critic's standard before its output is accepted — structurally identical to a Work Token system where nodes earn their reward only after passing a quality threshold.

Dynamics and Failure Modes

Reward hacking

After 500 iterations, the Generator has implicitly learned which of the 47 Critic criteria are checked superficially vs. substantively. It starts producing drafts that pass all criteria by inserting the correct formula strings in the right positions — but the surrounding context makes the clause legally incoherent. The Critic passes; the partner catches it on review. Fix: Critic must evaluate semantic correctness, not pattern matching. Human spot-check of 5% of accepted drafts is the detection mechanism — if the human rejection rate climbs, reward hacking is occurring.

Critic drift across iterations

On iteration 4 of a long contract, the Critic's accumulated context window causes it to evaluate clause 18 more leniently than it would in isolation — it has been evaluating for a while and implicitly adjusts its standards. Accepted draft on iteration 4 would have failed on iteration 1. Fix: Critic prompt must be stateless with respect to iteration count. Instantiate a fresh Critic for each evaluation with no history of prior rounds.

Budget ceiling on complex contracts

A 200-clause M&A agreement hits the 3-iteration limit without passing all criteria. The Iteration Gate escalates to human review. The human receives a draft that has passed 44/47 criteria — the 3 remaining issues are documented in the Critic's last delta. Partner fixes in 8 minutes instead of 20. Budget exhaustion is not failure — it is a known, documented handoff mode.

Variants

VariantModificationWhen to use
Multi-Critic Multiple independent Critics evaluate in parallel (legal, style, risk); Generator receives aggregated delta Quality has multiple independent dimensions — one Critic cannot cover all criteria credibly
Progressive Critic Critic applies easier criteria first; stricter criteria only after earlier rounds pass Avoid spending compute on complex criteria when basic criteria fail — fail fast on obvious issues
Human-Critic Hybrid Automated Critic handles first N iterations; human activates only after automated pass Human review is expensive but required for final acceptance — automate the easy rejections

Related Patterns

PatternRelationship
10.11 PipelineBase structure — Evaluator-Optimizer adds a feedback loop after the last pipeline stage
30.31 Feedback LoopOrganizational-level analog — the Critic's judgments, aggregated over time, become the training signal for the Feedback Loop that improves the Generator's base model

Investment Signal

The Critic is the primary IP in this pattern. A Generator is available from model APIs; a calibrated Critic encoding institutional quality standards is built from accumulated judgment and is proprietary.

How was the Critic trained? If it was hand-crafted from first principles by one person, it is fragile and person-dependent. If it was calibrated against thousands of historical accepted/rejected outputs with human-labeled deltas, it encodes institutional knowledge and compounds over time.

Acquisition thesis: acquiring the Critic is acquiring the quality standard. Plugging a better Generator into an existing Critic is trivial. Building a new Critic from scratch requires months of labeling and calibration work — this is the moat.