Noisy alerting is not a minor annoyance. It is how teams train themselves to ignore real failures.
The fix is simple and hard: page humans only when error-budget risk is real, and automate everything else.
Operator Insight
The core argument: a 3-strike policy tied to SLO burn-rate thresholds cuts alert noise without increasing missed incidents.
Actionable Alert Ratio (AAR)
AAR = pages that required human intervention / total pages
- Target
AAR >= 0.30. - If AAR sits below
0.30for a full week, your paging policy is too noisy.
3-Strike Matrix (Per Failure Class, 60-Minute Window)
| Strike | Trigger | System action | Human action |
|---|---|---|---|
| 1 | First qualifying failure | Retry once with jitter and attach trace context | None |
| 2 | Second failure in same class | Pause failing segment and route to fallback | Notify owner, no page |
| 3 | Third failure in same class, or burn-rate breach | Hard-stop risky path and open incident | Page on-call |
Failure class means root category (auth, timeout, policy_block, dependency), not raw error string.
Severity Threshold Defaults
| Metric | Threshold | Page level | Owner |
|---|---|---|---|
| Short-window burn rate | > 14x | P1 | On-call operator |
| Long-window burn rate | > 6x | P2 | On-call operator |
| Retry storm ratio | > 15% of run cost | P2 | Platform owner |
| Unexpected policy blocks | > 10% of requests | P2 | Security owner |
| Single transient error | One event | No page | Workflow owner |
Concrete example: 120 pages/week with 22 actionable pages gives AAR = 0.18. That team should remove or demote noisy rules before adding new alerts.
Alert Tuning Playbook
Daily (10 Minutes)
- Classify yesterday’s pages as actionable or noise.
- Remove or demote one noisy rule.
- Verify every active page rule has a named owner.
- Check that strike-2 fallback worked for top failure class.
Weekly (30 Minutes)
- Compute AAR, false-positive rate, and missed-incident count.
- Reclassify top two noisy failure classes.
- Run one no-page simulation to confirm automated containment.
- Publish policy diff so operators know what changed.
Tradeoffs and Limits
- Aggressive de-paging can hide slow-burn failures. Keep burn-rate alerts intact.
- Teams often define failure classes too narrowly; this creates duplicate pages.
- Strike systems fail when fallback paths are broken. Test fallback weekly.
- If on-call load is already high, policy changes without owner coverage can backfire.
Source Citations
- Google SRE Workbook: Alerting on SLOs
- Google SRE Workbook: Monitoring Distributed Systems
- OpenTelemetry Specification
- NIST AI Risk Management Framework 1.0
CTA
Apply the policy directly: Get the Agent Alert Policy Pack