Agent Handoff Playbook: Prevent Silent Failures at 2 AM

A production handoff protocol with SLA bands, packet schema, and escalation ownership so agent incidents reach the right human before damage compounds.

Most “AI incidents” are really handoff incidents: wrong owner, missing context, no next action.

If the first human sees a vague alert, recovery time doubles before anyone touches root cause.

Operator Insight

The core argument: handoff quality is a measurable reliability surface, and you should gate pages on packet completeness.

Handoff Completeness Score (HCS)

HCS = 0.30C + 0.25O + 0.25N + 0.20R

  • C: context completeness
  • O: ownership clarity
  • N: next-action clarity
  • R: reproducibility

Score each component 0-100.

HCS Gating Policy

SeverityMinimum HCSWhat happens if below threshold
P190Auto-enrich packet, then page
P285Route to owner queue until packet is complete
P370Async ticket with template completion required

Concrete example: a P1 page with run_id missing and no rollback command should not page until enrichment fills both fields.

Required Handoff Packet (10 Fields)

  1. run_id and workflow_id
  2. Objective being executed
  3. Failure class
  4. Last successful step
  5. Failing step and error summary
  6. Blast-radius estimate (users/jobs)
  7. Immediate recommended action
  8. Rollback or fallback command
  9. Primary owner and backup owner
  10. Packet TTL (for urgency decay)

Escalation SLA Bands

SeverityFirst response targetEscalation routeDecision authority
P1<= 5 minOn-call + incident captainIncident captain
P2<= 15 minWorkflow owner + platform supportWorkflow owner
P3<= 4 hAsync owner queueTeam lead

Handoff Operations Playbook

Daily (15 Minutes)

  1. Review top 10 escalations from prior 24h.
  2. Score packet quality using HCS.
  3. Fix one recurring missing field in templates.
  4. Track MTTA by severity.

Weekly (30 Minutes)

  1. Run one synthetic P1 handoff drill.
  2. Measure HCS, first response time, and containment time.
  3. Rotate backup owners through primary role.
  4. Remove stale routing rules.

Tradeoffs and Limits

  • Strict HCS gates can delay urgent pages if enrichment is slow. Keep a manual override for obvious P1 harm.
  • Teams often overfill packets with logs. More text is not better context.
  • Ownership maps decay quickly with org changes. Audit coverage weekly.
  • If rollback commands are untested, “complete” packets still fail in practice.

Source Citations

CTA

Standardize handoffs now: Get the Agent Handoff Playbook Kit

Want the qualified pipeline leak check + weekly teardown?

Weekly operator tactics plus a leak-check worksheet for founders/operators/devs tightening qualified conversion.

Qualification rules: verified email + ICP fit + intent signal within 7 days (bots/disposable/internal aliases excluded).