Most “AI incidents” are really handoff incidents: wrong owner, missing context, no next action.
If the first human sees a vague alert, recovery time doubles before anyone touches root cause.
Operator Insight
The core argument: handoff quality is a measurable reliability surface, and you should gate pages on packet completeness.
Handoff Completeness Score (HCS)
HCS = 0.30C + 0.25O + 0.25N + 0.20R
C: context completenessO: ownership clarityN: next-action clarityR: reproducibility
Score each component 0-100.
HCS Gating Policy
| Severity | Minimum HCS | What happens if below threshold |
|---|---|---|
| P1 | 90 | Auto-enrich packet, then page |
| P2 | 85 | Route to owner queue until packet is complete |
| P3 | 70 | Async ticket with template completion required |
Concrete example: a P1 page with run_id missing and no rollback command should not page until enrichment fills both fields.
Required Handoff Packet (10 Fields)
run_idandworkflow_id- Objective being executed
- Failure class
- Last successful step
- Failing step and error summary
- Blast-radius estimate (users/jobs)
- Immediate recommended action
- Rollback or fallback command
- Primary owner and backup owner
- Packet TTL (for urgency decay)
Escalation SLA Bands
| Severity | First response target | Escalation route | Decision authority |
|---|---|---|---|
| P1 | <= 5 min | On-call + incident captain | Incident captain |
| P2 | <= 15 min | Workflow owner + platform support | Workflow owner |
| P3 | <= 4 h | Async owner queue | Team lead |
Handoff Operations Playbook
Daily (15 Minutes)
- Review top 10 escalations from prior 24h.
- Score packet quality using HCS.
- Fix one recurring missing field in templates.
- Track MTTA by severity.
Weekly (30 Minutes)
- Run one synthetic P1 handoff drill.
- Measure HCS, first response time, and containment time.
- Rotate backup owners through primary role.
- Remove stale routing rules.
Tradeoffs and Limits
- Strict HCS gates can delay urgent pages if enrichment is slow. Keep a manual override for obvious P1 harm.
- Teams often overfill packets with logs. More text is not better context.
- Ownership maps decay quickly with org changes. Audit coverage weekly.
- If rollback commands are untested, “complete” packets still fail in practice.
Source Citations
- Google SRE Book: Managing Incidents
- Google SRE Workbook: Emergency Response
- OpenTelemetry Context Propagation
- NIST AI Risk Management Framework 1.0
CTA
Standardize handoffs now: Get the Agent Handoff Playbook Kit