10.16 Staged Pipeline with Per-Stage Feedback

Sequential pipeline stages each have an embedded Critic agent. The Critic can accept the stage output (advance), request a retry within the stage, or route backward to a prior stage. Multi-stage quality gates that can cross stage boundaries.


Motivating Scenario

A software development agent system builds features from specs. Without per-stage feedback, code generation with no inline review produces 40% of features requiring rework after integration testing. The rework is discovered too late - after the full pipeline has executed, the error is expensive to fix.

With staged feedback: the Planner output is reviewed by a Plan Critic, which catches infeasible specs before coding begins. The Coder output is reviewed by a Code Critic, which catches bugs before the test suite runs. The Test Runner output is reviewed by a Test Critic, which catches coverage gaps before deployment. Result: 8% rework rate post-deployment, 2.3x more pipeline stages but 3.1x less total rework compute. Each Critic is a quality gate that shifts defect detection left - the earlier in the pipeline a bug is caught, the cheaper it is to fix.

Structure

Zoom and pan enabled · Concrete example: software development agent pipeline with per-stage critics

Key Metrics

MetricSignal
Mean iterations per stage Baseline efficiency signal - rising mean at any stage indicates Critic over-strictness or Generator quality degradation at that stage
Cross-stage backtrack rate % of critic decisions that route to a prior stage - target ≤15%; above 25% indicates systemic upstream quality failure, not localized errors
Stage completion rate within budget % of features completing each stage within the retry budget - low completion rate signals critic calibration problems or generator quality regression
Per-stage critic accuracy Does critic rejection predict downstream failure? Measure by tracking whether critic-passed outputs fail at later stages - low predictive accuracy means the critic is not catching the right defect class
NodeWhat it doesWhat it receivesWhat it produces
Planner Decomposes feature spec into implementation plan: tasks, interfaces, acceptance criteria. On retry, incorporates Plan Critic feedback. Feature spec (+ Plan Critic feedback on retry) Structured implementation plan: task list, interface contracts, test criteria
Plan Critic Checks plan feasibility: missing interfaces, ambiguous acceptance criteria, underspecified edge cases. Routes forward on accept, back to Planner on redo, or flags spec-level issues for upstream escalation. Implementation plan + spec + code standards Accept verdict OR structured feedback: {issue, location, required change}
Coder Implements the plan. On retry from Code Critic, incorporates specific issue list. On back-route from Code Critic, waits for revised plan from Planner. Implementation plan (+ Code Critic feedback on retry) Code artifact: implementation + inline documentation
Code Critic Checks code for correctness, style compliance, test coverage hooks. Does not run code - static analysis only. Routes forward on accept, back to Coder on redo, or back to Planner if plan is root cause. Code artifact + implementation plan + coding standards Accept verdict OR structured issue list with routing decision: {redo | back-to-plan}
Test Runner Executes test suite against code. Generates coverage report and failure trace. On retry, re-runs with expanded test scope per Test Critic guidance. Code artifact + test criteria from plan Test report: pass/fail per criterion, coverage percentage, failure traces
Test Critic Checks test coverage against acceptance criteria. Identifies untested paths and insufficient assertions. Routes forward on accept, back to Test Runner for expanded coverage, or back to Coder if tests reveal implementation bugs. Test report + acceptance criteria + coverage targets Accept verdict OR routing decision: {redo-test | back-to-code} + specific gap list
Deploy Agent Packages and deploys the feature. Emits deployment confirmation with artifact reference. Approved code artifact + approved test report Deployed feature: artifact ID, deployment timestamp, rollback reference

When to Use

Use when
Avoid when

Value Profile

Origin of ValueWhere it appearsHow it is captured
Future Cashflow Output quality vs. pipeline cost Value realized as rework reduction post-deployment. The economic case requires that critic compute cost at each stage is less than the expected rework cost it prevents. At 8% vs 40% post-deployment rework rate, the 3.1x compute saving is the primary value argument.
Governance Each Critic node Each Critic encodes stage-specific quality policy. The Plan Critic enforces planning standards; the Code Critic enforces coding standards; the Test Critic enforces coverage policy. The Critic ensemble is the organization's quality governance layer - changes to standards are changes to Critics.
Risk Exposure Cross-stage backtrack rate Too-high backtrack rate means the system is unreliable and expensive. Too-low backtrack rate means Critics are calibrated too loosely and defects pass through. Target: ≤15% cross-stage backtracks; ≤30% within-stage retries. Rates outside these bounds signal critic calibration problems.
Conditional Action Critic compute at every stage Critics are always-on cost. Unlike the pipeline workers (which are fixed per execution), every critic runs on every output - including good outputs. Critic cost is proportional to output volume, not defect rate. In high-throughput systems, critic compute can exceed worker compute.
VCM analog: Work Token chain with quality gates. Each stage transition requires a Critic co-signature. The Critic is a staking mechanism - it puts its judgment on the line at each gate. A Critic that consistently passes defective outputs is a staker that consistently approves bad proposals - its authority should diminish over time.

Dynamics and Failure Modes

Cross-stage loop explosion

The Code Critic, finding a bug in the implementation, routes back to the Planner (judging the plan as root cause). The Planner produces a revised plan. The Coder implements it. The Code Critic finds a different bug and routes back to the Planner again. Without an iteration budget spanning the full cross-stage path, this creates an unbounded loop that never converges. Fix: implement a cross-stage iteration counter that persists across stage boundaries. A feature that has been routed backward across stages more than N times (e.g., 3) is escalated to human review - it is not a routing problem, it is a spec problem that no automated loop can resolve.

Critic calibration drift

After 200 features, the Code Critic has been shown many "good enough" outputs that passed human review. Its effective threshold has drifted - it now accepts code that it would have rejected in week 1. The drift is invisible in daily metrics because the system is producing outputs, but post-deployment defect rates start climbing. Fix: monthly recalibration cycle. Sample 50 critic decisions (25 accept, 25 reject) and have a human engineer score them independently. Critic accuracy vs. human judgment is the calibration metric - if accuracy drops below 85%, retraining is triggered.

Stage isolation violation

The Test Critic identifies a coverage gap and wants to diagnose root cause: is it a test problem or a code problem? It cannot see the implementation plan (stage 1 output) - it only receives the code artifact and test report. It lacks the context to route correctly. It sends everything back to the Coder as a guess. The Coder rewrites working code when the real issue was an underspecified acceptance criterion in the plan. Fix: each Critic receives the original spec and all prior stage outputs as context, not only the immediately preceding output. Isolation is a performance optimization, not a correctness requirement - do not sacrifice diagnostic accuracy for context frugality.

Forward progress starvation

A strict Test Critic rejects 90% of test reports as "insufficient coverage." Each rejection routes back to the Test Runner, which re-runs with expanded scope. The feature never advances to deployment - every output triggers another retry. The system is correct in identifying coverage gaps but has no progress mechanism. Fix: within-stage retry budgets enforced at the Critic level. After N retries, the Critic must either accept with documented gaps or escalate to human review. "Never advance" is not a valid Critic decision - every gate must have a forced-advance path.

Variants

VariantModificationWhen to use
Critic-as-Veto Critics can only block (force retry within stage), not route backward - simpler routing graph with no cross-stage edges Cross-stage routing adds complexity that exceeds its value - when defects are almost always localized to the current stage and backward routing rarely produces better results
Budget-Bounded Feedback Each stage has a max retry count; on budget exhaustion the Critic must accept with documented gaps or escalate - forced-advance prevents starvation Production systems with throughput SLAs - forward progress guarantees are non-negotiable and human escalation is a defined, acceptable outcome
Cascading Critics Each Critic passes its judgment to the next Critic before a final gate decision - the Test Critic sees what the Plan Critic and Code Critic flagged before making its routing decision Downstream critics need upstream quality context to diagnose root cause - prevents the stage isolation violation failure mode at the cost of tighter inter-critic coupling

Related Patterns

PatternRelationship
10.11 PipelineBase structure without feedback - use when stage quality is sufficient without critics and rework cost is acceptable
10.15 Evaluator-OptimizerSingle-stage feedback loop - use when only one stage requires iterative refinement rather than quality gates at every stage
30.31 Feedback LoopFeedback over time vs. feedback within a single execution - Critics in this pattern make per-execution routing decisions; Feedback Loop aggregates decisions across executions to improve models

Investment Signal

The Critic ensemble is the firm's quality standard encoded as software. A pipeline that runs with no per-stage critics is a pipeline that cannot tell you where quality is lost - it produces outputs, and humans discover quality problems downstream. A pipeline with calibrated critics has an auditable quality profile: per-stage defect rates, critic accuracy, and backtrack distributions are all measurable.

Acquirers should ask: are the critics calibrated against ground truth, or are they hand-crafted heuristics? Hand-crafted critics drift and degrade as the underlying task distribution shifts. Critics trained against historical labeled outputs compound over time - they are organizational knowledge, not code.

Red flag: a system where the total cross-stage backtrack rate is below 1% has critics that are either too lenient (passing defects through) or irrelevant (the pipeline does not produce defects at those stages). Neither is evidence of a well-functioning quality layer. Target: critics that block 10..30% of outputs within-stage and route 5..15% backward. A critic that never blocks is not a critic - it is overhead.