Normal path: execute → tool → observe.
Problem-first intro
You build a multi-agent setup:
- “research agent”
- “planner agent”
- “executor agent”
- “reviewer agent”
It looks great in diagrams.
Then in production, one request gets stuck forever because:
- Agent A is waiting on Agent B’s output
- Agent B is waiting on Agent C’s approval
- Agent C is waiting on Agent A’s context
Nobody is “wrong”. They’re just waiting.
That’s a deadlock.
Multi-agent deadlocks are painful because they don’t crash. They hang. And hangs burn budgets quietly.
Quick take
- Multi-agent deadlocks are distributed-systems deadlocks with nicer logs: cycles + waiting + missing timeouts.
- Fix with orchestration: one owner of state transitions, leases/TTLs for shared resources, and explicit stop reasons.
- Add wait timeouts everywhere; “waiting without a timeout” is just a paid sleep.
Why this fails in production
Multi-agent systems inherit every failure mode of distributed systems, plus LLM ambiguity.
1) Circular dependencies are easy to create
It’s tempting to define responsibilities like:
- “planner” asks “researcher”
- “researcher” asks “reviewer”
- “reviewer” asks “planner”
Congratulations, you built a cycle.
2) No timeouts on “waiting”
Many systems add timeouts to HTTP calls, but not to “agent messages”. So the agent waits forever while your worker stays busy.
3) Shared resources without leases
If agents share:
- a ticket
- a document
- a lock
…and you don’t use leases/TTLs, a crash can leave the system permanently blocked.
4) “Ask another agent” becomes a retry loop
When an agent is uncertain, a common behavior is:
- ask another agent
- ask again if no response
- ask a third agent
That turns deadlock into tool spam.
5) The fix is orchestration, not more prompting
You won’t “prompt” your way out of deadlocks. You need:
- a single orchestrator (or at least a leader)
- explicit state machine transitions
- timeouts and leases
- and a stop reason when the system can’t progress
Implementation example (real code)
This is a minimal “lease lock” pattern for shared work. The idea:
- an agent acquires a lease for
resource_id - if it crashes, the lease expires
- the orchestrator can recover and reassign
from dataclasses import dataclass
import time
@dataclass
class Lease:
owner: str
expires_at: float
class LeaseLock:
def __init__(self) -> None:
self._leases: dict[str, Lease] = {}
def try_acquire(self, *, resource_id: str, owner: str, ttl_s: int) -> bool:
now = time.time()
lease = self._leases.get(resource_id)
if lease and lease.expires_at > now and lease.owner != owner:
return False
self._leases[resource_id] = Lease(owner=owner, expires_at=now + ttl_s)
return True
def release(self, *, resource_id: str, owner: str) -> None:
lease = self._leases.get(resource_id)
if lease and lease.owner == owner:
del self._leases[resource_id]
def run_work(orchestrator_id: str, resource_id: str, lock: LeaseLock) -> str:
if not lock.try_acquire(resource_id=resource_id, owner=orchestrator_id, ttl_s=30):
return "blocked: lease held"
try:
# orchestrate agents here (pseudo)
return orchestrate(resource_id) # (pseudo)
finally:
lock.release(resource_id=resource_id, owner=orchestrator_id)export class LeaseLock {
constructor() {
this.leases = new Map(); // resourceId -> { owner, expiresAtMs }
}
tryAcquire({ resourceId, owner, ttlS }) {
const now = Date.now();
const lease = this.leases.get(resourceId);
if (lease && lease.expiresAtMs > now && lease.owner !== owner) return false;
this.leases.set(resourceId, { owner, expiresAtMs: now + ttlS * 1000 });
return true;
}
release({ resourceId, owner }) {
const lease = this.leases.get(resourceId);
if (lease && lease.owner === owner) this.leases.delete(resourceId);
}
}This doesn’t solve every deadlock (cycles are still cycles), but it prevents the worst kind: “the system is stuck because an agent died holding the lock”.
Also: put timeouts on “agent waits”. A wait without a timeout is a sleep you pay for.
Example failure case (incident-style, numbers are illustrative)
We had a multi-agent “incident triage” flow:
- Agent A collected signals
- Agent B wrote a hypothesis
- Agent C validated against a runbook
When the runbook tool degraded, Agent C waited for a response. Agent B waited on Agent C. Agent A waited on Agent B.
Impact:
- 43 runs stuck in “waiting” state
- workers saturated and new requests queued
- on-call burned ~2 hours manually canceling runs and clearing state (example)
Fix:
- timeouts on inter-agent waits
- orchestrator-owned leases per incident id
- stop reasons: “blocked waiting for tool” vs “blocked waiting for approval”
- fallback: single-agent mode when dependencies are degraded
Multi-agent makes coordination your problem. You don’t get to outsource it to the LLM.
Trade-offs
- Orchestration code is work. It’s cheaper than deadlocks.
- Leases can expire mid-work; you need idempotency and replay.
- Single-agent fallback reduces quality, increases liveness.
When NOT to use
- If the task is small, multi-agent is unnecessary overhead.
- If you can’t build orchestration and observability, don’t ship multi-agent in prod.
- If you need strict ordering and consistency, use workflows with explicit state machines.
Copy-paste checklist
- [ ] Avoid circular dependencies between agents (write it down as a graph)
- [ ] Add timeouts to “waiting” states
- [ ] Use leases/TTLs for shared resources
- [ ] One orchestrator owns state transitions
- [ ] Idempotency keys for any writes
- [ ] Stop reasons for blocked states + alerting
- [ ] Fallback mode when dependencies degrade
Safe default config snippet (JSON/YAML)
multi_agent:
orchestrator: "single_owner"
wait_timeouts_s: { default: 30 }
leases:
ttl_s: 30
renew: true
fallback:
enabled: true
mode: "single_agent"
FAQ (3–5)
Used by patterns
Related failures
Related pages (3–6 links)
- Foundations: Planning vs reactive agents · Why agents fail in production
- Failure: Partial outage handling · Tool spam loops
- Governance: Tool permissions (allowlists)
- Production stack: Production agent stack