Shadow-Mode Canary Score: The Pre-Launch Gate That Prevents Agent Rollout Regret

A weighted canary score with pass/fail thresholds, owner actions, and rollout stages so teams can graduate workflows from shadow mode safely.

Moving from shadow mode to autonomous writes without a hard gate is reckless.

Most rollout regret comes from one mistake: treating “it looks good” as evidence.

Operator Insight

The core argument: graduation from shadow mode requires a weighted score, minimum sample quality, and strict fail conditions.

Canary Score Formula

Canary Score = 0.30A + 0.20L + 0.20O + 0.15F + 0.15C

  • A: action accuracy versus accepted outcomes
  • L: latency stability at p95
  • O: override pressure
  • F: failure containment performance
  • C: operator clarity/readiness

Graduation Policy

Score bandActionOwner
>= 85Promote to limited autonomyDev lead + on-call operator
70-84Stay in shadow mode and patch weakest dimensionWorkflow owner
< 70Block promotionIncident captain

Hard stop: promotion is blocked if any single dimension is below 70, even when total score passes.

Minimum Evidence Requirements

  • At least 50 representative cases
  • At least one peak-load window
  • At least one injected failure drill
  • Signed pass/fail decision log

Concrete example: total score 86 with failure containment 62 still fails gate.

Rollout Playbook

Stage 1: Shadow Mode (0% User Impact)

  • Log decisions only.
  • Compare against accepted human outcomes.

Stage 2: Limited Canary (5-10%)

  • Enable low-blast-radius autonomous writes.
  • Monitor accuracy drift, latency tail, and overrides.

Stage 3: Expansion (25% -> 50% -> 100%)

  • Expand only if score stays >= 85 across two windows.
  • Freeze if any P1/P2 incident appears.

Tradeoffs and Limits

  • Strict sample requirements delay launch speed.
  • Representative sample collection can be costly for niche workflows.
  • High score can still miss new failure modes after product changes.
  • Overweighting latency can promote fast but low-quality behavior.

Source Citations

CTA

Use the same pre-launch gate: Get the Incident Drill Pack

Want the qualified pipeline leak check + weekly teardown?

Weekly operator tactics plus a leak-check worksheet for founders/operators/devs tightening qualified conversion.

Qualification rules: verified email + ICP fit + intent signal within 7 days (bots/disposable/internal aliases excluded).