Normal path: execute → tool → observe.
Quick take
- Take a tool health snapshot at run start (breaker state + recent errors).
- If a critical dependency is degraded, disable it for the run and switch to degrade mode.
- Return partial results + explicit stop reason (don’t spin until timeout).
- Budgets still apply (time/tool calls/spend) — outages amplify loops.
Problem-first intro
It’s not a full outage. It’s worse.
One tool is flaky:
- sometimes 200
- sometimes timeout
- sometimes 502
Your agent keeps trying to “finish the task”. Users keep retrying because they get timeouts. Budgets keep burning because every retry is a new run.
Partial outages are where you learn whether your agent is an engineer or a gambler.
Why this fails in production
Partial outages are hard because success is intermittent. That tempts loops.
1) The agent treats intermittent success as “keep trying”
LLMs are optimistic. If they get one partial result, they’ll often keep going to “complete it”.
That’s fine in a notebook. In prod it’s runaway spend.
2) No concept of tool health
If the agent doesn’t know “tool X is degraded”, it will:
- keep calling it
- retry it
- replan around it and call it again
You need a shared health signal:
- circuit breaker state
- recent error rate
- latency spikes
3) No safe-mode behavior
When a tool is degraded, you need a plan that doesn’t depend on it:
- use cached data
- return partial results
- stop with a reason and let the user decide
4) “All or nothing” outputs force bad behavior
If your API contract is “always return the full answer”, your agent will thrash during partial outages.
Better contract:
- return partial results + confidence + stop reason
- optionally: an async continuation
Implementation example (real code)
This pattern uses a “health snapshot” taken at the start of a run. If a critical tool is degraded, we:
- disable it for the run
- switch to safe-mode behavior
- return partial results with an explicit stop reason
from dataclasses import dataclass
from typing import Any
@dataclass(frozen=True)
class Health:
degraded_tools: set[str]
def snapshot_health() -> Health:
# In real code: breaker states + recent error rates.
return Health(degraded_tools=set(get_degraded_tools())) # (pseudo)
def safe_tools_for_run(health: Health) -> set[str]:
allow = {"search.read", "kb.read", "http.get"}
# During outages: be conservative.
for t in health.degraded_tools:
allow.discard(t)
return allow
def run(task: str) -> dict[str, Any]:
health = snapshot_health()
allow = safe_tools_for_run(health)
if "kb.read" not in allow:
return {
"status": "degraded",
"reason": "kb.read degraded",
"partial": "I can’t reliably read the KB right now. Here’s what I can do without it…",
}
# Normal loop would run here with a tool gateway allowlist = allow.
return agent_loop(task, allow=allow) # (pseudo)export function snapshotHealth() {
// Real code: breaker states + recent error rates.
return { degradedTools: new Set(getDegradedTools()) }; // (pseudo)
}
export function safeToolsForRun(health) {
const allow = new Set(["search.read", "kb.read", "http.get"]);
for (const t of health.degradedTools) allow.delete(t);
return allow;
}
export function run(task) {
const health = snapshotHealth();
const allow = safeToolsForRun(health);
if (!allow.has("kb.read")) {
return {
status: "degraded",
reason: "kb.read degraded",
partial: "I can’t reliably read the KB right now. Here’s what I can do without it…",
};
}
return agentLoop(task, { allow }); // (pseudo)
}This is intentionally conservative. During partial outages, your goal is not “succeed at all costs”. Your goal is “don’t turn a partial outage into a full outage”.
Example incident (numbers are illustrative)
Example: an agent that answered support questions using kb.read.
The KB service degraded (p95 latency from ~300ms → 9s, intermittent timeouts). Our agent kept trying because sometimes it worked.
Impact:
- average run time: 8s → 52s
- client retries doubled traffic
- on-call got paged for “agent timeouts”, not for the real KB issue
- spend increased ~$180/day just on retries + longer prompts
Fix:
- health snapshot + degrade mode
- fail fast after breaker opens
- return partial results + clear stop reason
- a “retry later” hint instead of silent timeouts
Partial outages are where we learned: user-visible stop reasons are a feature.
Trade-offs
- Degrade mode answers are less complete.
- Failing fast reduces success rate in the moment.
- Health signals can be wrong (false positives). That’s better than thrashing.
When NOT to use
- If you need strict completeness, run async and report progress instead of looping synchronously.
- If you can’t define partial output semantics, you’ll be forced into timeouts (bad).
- If you don’t have tool health signals, start with budgets and breaker defaults.
Copy-paste checklist
- [ ] Tool health snapshot at run start
- [ ] Degrade mode policy (tools disabled, read-only, cached)
- [ ] Fail fast when breaker is open
- [ ] Return partial results + explicit stop reason
- [ ] Budget caps (time/tool calls/spend) still apply
- [ ] Alerting on degraded runs vs normal runs
Safe default config snippet (JSON/YAML)
degrade_mode:
enabled: true
disable_tools_when_degraded: true
allow_partial_results: true
health:
breaker_open_means_degraded: true
budgets:
max_seconds: 60
max_tool_calls: 12
FAQ (3–5)
Used by patterns
Related failures
Related pages (3–6 links)
- Foundations: What makes an agent production-ready · Why agents fail in production
- Failure: Cascading tool failures · Tool spam loops
- Governance: Tool permissions (allowlists)
- Production stack: Production agent stack