Moving from shadow mode to autonomous writes without a hard gate is reckless.
Most rollout regret comes from one mistake: treating “it looks good” as evidence.
Operator Insight
The core argument: graduation from shadow mode requires a weighted score, minimum sample quality, and strict fail conditions.
Canary Score Formula
Canary Score = 0.30A + 0.20L + 0.20O + 0.15F + 0.15C
A: action accuracy versus accepted outcomesL: latency stability at p95O: override pressureF: failure containment performanceC: operator clarity/readiness
Graduation Policy
| Score band | Action | Owner |
|---|---|---|
>= 85 | Promote to limited autonomy | Dev lead + on-call operator |
70-84 | Stay in shadow mode and patch weakest dimension | Workflow owner |
< 70 | Block promotion | Incident captain |
Hard stop: promotion is blocked if any single dimension is below 70, even when total score passes.
Minimum Evidence Requirements
- At least
50representative cases - At least one peak-load window
- At least one injected failure drill
- Signed pass/fail decision log
Concrete example: total score 86 with failure containment 62 still fails gate.
Rollout Playbook
Stage 1: Shadow Mode (0% User Impact)
- Log decisions only.
- Compare against accepted human outcomes.
Stage 2: Limited Canary (5-10%)
- Enable low-blast-radius autonomous writes.
- Monitor accuracy drift, latency tail, and overrides.
Stage 3: Expansion (25% -> 50% -> 100%)
- Expand only if score stays
>= 85across two windows. - Freeze if any P1/P2 incident appears.
Tradeoffs and Limits
- Strict sample requirements delay launch speed.
- Representative sample collection can be costly for niche workflows.
- High score can still miss new failure modes after product changes.
- Overweighting latency can promote fast but low-quality behavior.
Source Citations
- Google SRE Book: Testing for Reliability
- Microsoft: Safe Deployment Practices
- OpenTelemetry Specification
- NIST AI Risk Management Framework 1.0
CTA
Use the same pre-launch gate: Get the Incident Drill Pack