Most “slow model” complaints are really slow tool orchestration.
Without a hard latency budget, multi-tool agents quietly drift from useful to unusable.
Operator Insight
The core argument: latency must be budgeted per workflow step with explicit degrade rules, not optimized ad hoc.
End-to-End Latency Equation
L_e2e = L_plan + sum(L_tool_i) + L_post
L_plan: planner/model decision latencyL_tool_i: each retrieval/API/write callL_post: validation, formatting, and response delivery
If you run five tool hops at 1.6s p95 each, you already spend 8s before post-processing.
Default p95 Budget (Interactive Flows)
| Step | p95 target | Degrade action | Owner |
|---|---|---|---|
| Planner/model | 1.5s | Smaller reasoning profile for low-risk intents | Model owner |
| Retrieval/read tools | 2.0s | Return partial context and continue async enrichment | Data/tool owner |
| External write tools | 2.5s | Queue write with confirmation step | Workflow owner |
| Post-processing | 1.0s | Trim non-critical formatting | App owner |
| Transport overhead | 1.0s | Send immediate progress state | Channel owner |
Total: 8.0s p95 budget.
Guardrail Policy
| Metric | Trigger | Immediate action |
|---|---|---|
| Workflow p95 | > 8s for 30 min | Reduce concurrency and queue non-urgent jobs |
| Timeout rate | > 2% over 1h | Switch to fallback dependency path |
| Tool-level p95 | Above budget for 3 windows | Bypass or replace slow tool |
| Queue median wait | > 2s | Re-tier queue priorities or add workers |
Practical Loop
Daily (15 Minutes)
- Rank workflows by p95 drift.
- Isolate bottleneck segment (
plan,tool, orpost). - Ship one fix only.
- Verify p95 and timeout delta next day.
Weekly (30 Minutes)
- Re-allocate step budgets from observed data.
- Remove one non-essential tool hop from top offender workflows.
- Rehearse degraded-mode user messaging on one critical flow.
- Publish budget ownership updates.
Tradeoffs and Limits
- Tight budgets can reduce response depth on complex tasks.
- Degradation paths can preserve speed while silently reducing quality.
- Tail-latency tuning may increase infra cost.
- If ownership is unclear per step, budgets become decorative.
Source Citations
- Google SRE Book: Addressing Cascading Failures
- OpenTelemetry Semantic Conventions
- AWS Well-Architected Reliability Pillar
- web.dev: User-Centric Performance Metrics
CTA
Adopt the worksheet: Get the Agent Latency Budget Pack