Normal path: execute → tool → observe.
Problem-first intro
One dependency goes flaky.
Your agent reacts by calling it more.
Now your dependency is more flaky.
Now your agent calls it even more.
That’s the whole story of cascading failures in agent systems: they amplify.
In production, the damage isn’t just “the agent failed”. It’s:
- rate limits hit for unrelated services
- queues back up
- on-call loses the ability to distinguish “real incidents” from “agent noise”
- and your agent becomes a load test nobody asked for
Quick take
- Agents are loops; retries without brakes turn partial tool failures into system-wide incidents.
- Put retries, breakers, and concurrency limits at the tool boundary (one choke point).
- Add safe-mode (partial results) so the agent stops thrashing when a dependency is degraded.
Why this fails in production
Agents are loops. Loops amplify feedback. That’s not AI. That’s control systems.
1) Naive retries
Retries are necessary. Retries without backoff/jitter are a thundering herd.
If 1,000 runs all retry a tool at the same time, you just created a second outage.
2) The agent retries and the tool retries
It’s common to have:
- HTTP client retry logic
- tool wrapper retry logic
- agent loop “try again” behavior
Multiply those and you get storms.
3) No circuit breaker
When a tool is clearly degraded (timeouts, 5xx), you need to stop calling it for a cooling period.
Without a circuit breaker, you keep hitting a failing dependency and making it worse.
4) No bulkheads (concurrency limits)
If one tool is slow, you don’t want it to starve everything else. Per-tool concurrency limits prevent one dependency from consuming all workers.
5) No safe-mode / fallback
Sometimes the correct behavior is:
- return partial results
- stop early with a clear reason
- switch to cached / last-known-good data
Agents that “must succeed” tend to thrash.
Implementation example (real code)
This is a small circuit breaker + bulkhead pattern you can drop in front of a tool.
from dataclasses import dataclass
import time
from typing import Callable, Any
@dataclass
class Breaker:
fail_threshold: int = 5
open_for_s: int = 30
failures: int = 0
opened_at: float | None = None
def allow(self) -> bool:
if self.opened_at is None:
return True
if time.time() - self.opened_at > self.open_for_s:
# half-open: reset and try again
self.failures = 0
self.opened_at = None
return True
return False
def on_success(self) -> None:
self.failures = 0
self.opened_at = None
def on_failure(self) -> None:
self.failures += 1
if self.failures >= self.fail_threshold:
self.opened_at = time.time()
class Bulkhead:
def __init__(self, *, max_in_flight: int) -> None:
self.max_in_flight = max_in_flight
self.in_flight = 0
def enter(self) -> None:
if self.in_flight >= self.max_in_flight:
raise RuntimeError("bulkhead full")
self.in_flight += 1
def exit(self) -> None:
self.in_flight = max(0, self.in_flight - 1)
def guarded_tool_call(
fn: Callable[[dict[str, Any]], Any],
*,
breaker: Breaker,
bulkhead: Bulkhead,
args: dict[str, Any],
) -> Any:
if not breaker.allow():
raise RuntimeError("circuit open (fail fast)")
bulkhead.enter()
try:
out = fn(args)
breaker.on_success()
return out
except Exception:
breaker.on_failure()
raise
finally:
bulkhead.exit()export class Breaker {
constructor({ failThreshold = 5, openForS = 30 } = {}) {
this.failThreshold = failThreshold;
this.openForS = openForS;
this.failures = 0;
this.openedAt = null;
}
allow() {
if (!this.openedAt) return true;
const elapsedS = (Date.now() - this.openedAt) / 1000;
if (elapsedS > this.openForS) {
this.failures = 0;
this.openedAt = null;
return true;
}
return false;
}
onSuccess() {
this.failures = 0;
this.openedAt = null;
}
onFailure() {
this.failures += 1;
if (this.failures >= this.failThreshold) this.openedAt = Date.now();
}
}
export class Bulkhead {
constructor({ maxInFlight = 10 } = {}) {
this.maxInFlight = maxInFlight;
this.inFlight = 0;
}
enter() {
if (this.inFlight >= this.maxInFlight) throw new Error("bulkhead full");
this.inFlight += 1;
}
exit() {
this.inFlight = Math.max(0, this.inFlight - 1);
}
}
export async function guardedToolCall(fn, { breaker, bulkhead, args }) {
if (!breaker.allow()) throw new Error("circuit open (fail fast)");
bulkhead.enter();
try {
const out = await fn(args);
breaker.onSuccess();
return out;
} catch (e) {
breaker.onFailure();
throw e;
} finally {
bulkhead.exit();
}
}This is not “enterprise resilience”. It’s a seatbelt. Without it, agents turn flaky dependencies into system-wide incidents.
Example failure case (incident-style, numbers are illustrative)
We had an agent that called a vendor API for enrichment. The vendor started timing out intermittently.
Our system had:
- client retries (2)
- tool wrapper retries (2)
- agent loop “try again” behavior (effectively unlimited)
Impact:
- vendor API went from “flaky” to “down”
- our worker pool saturated
- p95 latency across unrelated endpoints increased by ~3x (example)
- on-call spent ~2 hours isolating the blast radius (example)
Fix:
- circuit breaker (fail fast for 30s after threshold)
- per-tool bulkhead concurrency limit
- retries only in one place, with backoff + jitter
- safe-mode: skip enrichment and return partial results
The agent didn’t cause the initial failure. It scaled it.
Trade-offs
- Failing fast reduces “success rate” during partial outages. It prevents full outages.
- Bulkheads can reject some requests under load. That’s preferable to global saturation.
- Safe-mode outputs are less complete. They keep the system alive.
When NOT to use
- If the tool is fully internal and already has robust SLOs, you may not need per-tool breakers (still keep budgets).
- If you can’t define safe-mode behavior, don’t run autonomous loops during outages.
- If you need strict completeness, use async workflows rather than synchronous agents.
Copy-paste checklist
- [ ] Timeouts on every tool call
- [ ] Retries in one place only (gateway), with backoff + jitter
- [ ] Circuit breaker per tool (fail fast)
- [ ] Bulkhead concurrency limits per tool
- [ ] Budgets per run (time/tool calls/spend)
- [ ] Safe-mode fallback (partial results)
- [ ] Alerting: breaker open rate, tool error rates, tool latency
Safe default config snippet (JSON/YAML)
tools:
timeouts_s: { default: 10 }
retries: { max_attempts: 2, backoff_ms: [250, 750], jitter: true }
circuit_breaker:
fail_threshold: 5
open_for_s: 30
bulkhead:
max_in_flight: 10
safe_mode:
enabled: true
allow_partial: true
FAQ (3–5)
Used by patterns
Related failures
Related pages (3–6 links)
- Foundations: How agents use tools · What makes an agent production-ready
- Failure: Partial outage handling · Tool spam loops
- Governance: Tool permissions (allowlists)
- Production stack: Production agent stack