Shipping without rehearsal is not speed. It is hidden risk.
If your first real rollback happens during customer impact, your launch process is incomplete.
Operator Insight
The core argument: no autonomous launch should proceed without a timed drill that proves detection, containment, and recovery performance.
Rehearsal Readiness Score (RRS)
RRS = 0.35D + 0.35C + 0.20R + 0.10M
D: detection speedC: containment speedR: recovery success and speedM: communication quality
Promotion gate: RRS >= 85 and zero critical checklist failures.
20-Minute Drill Flow
| Minute | Stage | Pass condition | Owner |
|---|---|---|---|
00-03 | Inject failure | Alert or telemetry detects fault | On-call operator |
03-08 | Contain | Risky path paused or degraded safely | Workflow owner |
08-14 | Recover | Fallback or rollback succeeds | Dev lead |
14-20 | Communicate and log | First status update sent and timeline captured | Incident captain |
Use realistic scenarios: token expiry, dependency timeout, malformed tool output, or policy misroute.
Pass/Fail Thresholds
| Metric | Pass threshold | If failed |
|---|---|---|
| Detection time | <= 3 min | Patch alert mapping before launch |
| Containment time | <= 5 min | Add/repair kill-switch path |
| Recovery time | <= 10 min | Rebuild rollback flow and re-drill |
| First status update | <= 15 min | Pre-write comms templates |
Operating Loop
Pre-Launch
- Run one drill on the exact target workflow.
- Log RRS and blockers.
- Patch runbook immediately.
- Re-run if any critical step failed.
Weekly
- Rehearse one existing production workflow.
- Rotate backup owners through captain role.
- Validate rollback commands against current infra.
- Retire stale runbooks.
Tradeoffs and Limits
- Drills consume engineering time right before launch.
- Tabletop-only rehearsals can create false confidence.
- Tight pass thresholds can delay release windows.
- Rehearsal results degrade quickly after architecture changes.
Source Citations
- Google SRE Workbook: Emergency Response
- Google SRE Book: Managing Incidents
- AWS Well-Architected Reliability Pillar
- NIST AI Risk Management Framework 1.0
CTA
Gate launches with evidence: Get the Runbook Rehearsal Pack