Alert Fatigue Thresholds for Agent Ops: A 3-Strike Rule That Actually Works

Noisy alerting is not a minor annoyance. It is how teams train themselves to ignore real failures.

The fix is simple and hard: page humans only when error-budget risk is real, and automate everything else.

Operator Insight

The core argument: a 3-strike policy tied to SLO burn-rate thresholds cuts alert noise without increasing missed incidents.

AAR = pages that required human intervention / total pages

Strike	Trigger	System action	Human action
1	First qualifying failure	Retry once with jitter and attach trace context	None
2	Second failure in same class	Pause failing segment and route to fallback	Notify owner, no page
3	Third failure in same class, or burn-rate breach	Hard-stop risky path and open incident	Page on-call

Failure class means root category (auth, timeout, policy_block, dependency), not raw error string.

Metric	Threshold	Page level	Owner
Short-window burn rate	`> 14x`	P1	On-call operator
Long-window burn rate	`> 6x`	P2	On-call operator
Retry storm ratio	`> 15%` of run cost	P2	Platform owner
Unexpected policy blocks	`> 10%` of requests	P2	Security owner
Single transient error	One event	No page	Workflow owner

Concrete example: 120 pages/week with 22 actionable pages gives AAR = 0.18. That team should remove or demote noisy rules before adding new alerts.

Aggressive de-paging can hide slow-burn failures. Keep burn-rate alerts intact.
Teams often define failure classes too narrowly; this creates duplicate pages.
Strike systems fail when fallback paths are broken. Test fallback weekly.
If on-call load is already high, policy changes without owner coverage can backfire.