Most release incidents get worse because teams wait too long to choose rollback scope.
In an active failure, slow certainty is worse than fast containment.
Operator Insight
The core argument: define a rollback envelope before incidents so commanders can choose patch, pause, or full-revert in minutes.
Rollback Pressure Score
Rollback Pressure = 0.40H + 0.30G + 0.30B
H: observed customer harm (0-100)G: containment confidence gap (0-100)B: blast-radius growth rate (0-100)
Decision Bands
| Pressure score | Action | Decision owner |
|---|---|---|
< 40 | Patch in place | Workflow owner |
40-69 | Pause rollout, hold traffic, patch with guardrails | Incident captain |
>= 70 | Full rollback to last known-good version | Incident captain + release owner |
Timebox rule: if undecided after 10 minutes, choose the safer band.
Concrete example: moderate customer harm (55), high uncertainty (70), and rising blast radius (75) yields score 65.5; default action is pause-and-patch, not continued rollout.
Execution Checklist
- Record current version and rollout percentage.
- Score
H,G, andB. - Declare action band and owner.
- Publish status update with next checkpoint.
- Re-score every 10 minutes until stable.
Tradeoffs and Limits
- Safer bands can reduce throughput during ambiguous incidents.
- Poor scoring discipline can bias toward unnecessary full rollbacks.
- Partial rollback plans often miss async workers or scheduled jobs.
- Rollback speed claims are meaningless without drill data.
Source Citations
- Google SRE Book: Reliable Product Launches
- Google SRE Book: Managing Incidents
- AWS Well-Architected Reliability Pillar
- Microsoft: Safe Deployment Practices
CTA
Use the same decision envelope: Get the Agent Readiness Audit