The first 30 minutes of an agent incident decide whether you lose an hour or lose a week.
Teams fail here for one reason: no time budget and no clear transfer of authority.
Operator Insight
The core argument: you need a pre-committed escalation ladder with hard minute budgets, or containment drifts into debate.
Damage Exposure Model
Exposure = impacted workflows per minute * minutes uncontained * value-at-risk per workflow
This is why “just five more minutes” is expensive.
Concrete example: 8 workflows/min * 12 minutes * $40 value-at-risk = $3,840 exposure before root-cause analysis even starts.
30-Minute Escalation Ladder
| Time window | Objective | Mandatory action | Decision owner |
|---|---|---|---|
00:00-05:00 | Contain | Disable risky writes, cap retries, protect critical paths | On-call operator |
05:00-10:00 | Isolate | Quarantine failing workflows and capture representative traces | Workflow owner |
10:00-20:00 | Decide | Choose patch, degraded mode, or rollback | Incident captain |
20:00-30:00 | Stabilize + signal | Send status update and freeze net-new deploys | Incident captain + comms owner |
Severity Budget Matrix
| Severity | Max uncontained time | Auto-escalation rule |
|---|---|---|
| P1 | 5 min | Captain takes control immediately |
| P2 | 10 min | Captain notified at minute 5 |
| P3 | 30 min | Owner-led, captain informed async |
Operations Playbook
Daily (10 Minutes)
- Review yesterday’s incident timelines.
- Flag any minute-budget violation.
- Patch one ladder step that caused delay.
- Confirm next 24h captain coverage.
Weekly (30-45 Minutes)
- Run one timed escalation drill.
- Measure containment latency and communication latency.
- Tune severity mapping based on false escalations.
- Validate quarantine and rollback commands.
Tradeoffs and Limits
- Fast containment can over-throttle healthy traffic during ambiguous incidents.
- Strict captain authority can feel heavy in small teams; still better than indecision.
- If value-at-risk is guessed poorly, severity mapping drifts.
- Ladder quality decays unless drills keep it tied to current architecture.
Source Citations
- Google SRE Workbook: Emergency Response
- Google SRE Book: Managing Incidents
- AWS Well-Architected Reliability Pillar
- NIST AI Risk Management Framework 1.0
CTA
Use the ladder template now: Get the Incident Drill Pack