Deadlocks in Multi-Agent Systems (Failure Mode + Fixes + Code)

  • Spot the failure early before the bill climbs.
  • Learn what breaks in production and why.
  • Copy guardrails: budgets, stop reasons, validation.
  • Know when this isn’t the real root cause.
Detection signals
  • Tool calls per run spikes (or repeats with same args hash).
  • Spend or tokens per request climbs without better outputs.
  • Retries shift from rare to constant (429/5xx).
Agents waiting on agents is distributed deadlock with nicer logs. Here’s how deadlocks happen in production and how leases, timeouts, and orchestration prevent them.
On this page
  1. Problem-first intro
  2. Quick take
  3. Why this fails in production
  4. 1) Circular dependencies are easy to create
  5. 2) No timeouts on “waiting”
  6. 3) Shared resources without leases
  7. 4) “Ask another agent” becomes a retry loop
  8. 5) The fix is orchestration, not more prompting
  9. Implementation example (real code)
  10. Example failure case (incident-style, numbers are illustrative)
  11. Trade-offs
  12. When NOT to use
  13. Copy-paste checklist
  14. Safe default config snippet (JSON/YAML)
  15. FAQ (3–5)
  16. Related pages (3–6 links)
Interactive flow
Scenario:
Step 1/2: Execution

Normal path: execute → tool → observe.

Problem-first intro

You build a multi-agent setup:

  • “research agent”
  • “planner agent”
  • “executor agent”
  • “reviewer agent”

It looks great in diagrams.

Then in production, one request gets stuck forever because:

  • Agent A is waiting on Agent B’s output
  • Agent B is waiting on Agent C’s approval
  • Agent C is waiting on Agent A’s context

Nobody is “wrong”. They’re just waiting.

That’s a deadlock.

Multi-agent deadlocks are painful because they don’t crash. They hang. And hangs burn budgets quietly.

Quick take

  • Multi-agent deadlocks are distributed-systems deadlocks with nicer logs: cycles + waiting + missing timeouts.
  • Fix with orchestration: one owner of state transitions, leases/TTLs for shared resources, and explicit stop reasons.
  • Add wait timeouts everywhere; “waiting without a timeout” is just a paid sleep.

Why this fails in production

Multi-agent systems inherit every failure mode of distributed systems, plus LLM ambiguity.

1) Circular dependencies are easy to create

It’s tempting to define responsibilities like:

  • “planner” asks “researcher”
  • “researcher” asks “reviewer”
  • “reviewer” asks “planner”

Congratulations, you built a cycle.

2) No timeouts on “waiting”

Many systems add timeouts to HTTP calls, but not to “agent messages”. So the agent waits forever while your worker stays busy.

3) Shared resources without leases

If agents share:

  • a ticket
  • a document
  • a lock

…and you don’t use leases/TTLs, a crash can leave the system permanently blocked.

4) “Ask another agent” becomes a retry loop

When an agent is uncertain, a common behavior is:

  • ask another agent
  • ask again if no response
  • ask a third agent

That turns deadlock into tool spam.

5) The fix is orchestration, not more prompting

You won’t “prompt” your way out of deadlocks. You need:

  • a single orchestrator (or at least a leader)
  • explicit state machine transitions
  • timeouts and leases
  • and a stop reason when the system can’t progress
Diagram
Orchestrator + leases (avoid unowned waiting)

Implementation example (real code)

This is a minimal “lease lock” pattern for shared work. The idea:

  • an agent acquires a lease for resource_id
  • if it crashes, the lease expires
  • the orchestrator can recover and reassign
PYTHON
from dataclasses import dataclass
import time


@dataclass
class Lease:
  owner: str
  expires_at: float


class LeaseLock:
  def __init__(self) -> None:
      self._leases: dict[str, Lease] = {}

  def try_acquire(self, *, resource_id: str, owner: str, ttl_s: int) -> bool:
      now = time.time()
      lease = self._leases.get(resource_id)
      if lease and lease.expires_at > now and lease.owner != owner:
          return False
      self._leases[resource_id] = Lease(owner=owner, expires_at=now + ttl_s)
      return True

  def release(self, *, resource_id: str, owner: str) -> None:
      lease = self._leases.get(resource_id)
      if lease and lease.owner == owner:
          del self._leases[resource_id]


def run_work(orchestrator_id: str, resource_id: str, lock: LeaseLock) -> str:
  if not lock.try_acquire(resource_id=resource_id, owner=orchestrator_id, ttl_s=30):
      return "blocked: lease held"

  try:
      # orchestrate agents here (pseudo)
      return orchestrate(resource_id)  # (pseudo)
  finally:
      lock.release(resource_id=resource_id, owner=orchestrator_id)
JAVASCRIPT
export class LeaseLock {
constructor() {
  this.leases = new Map(); // resourceId -> { owner, expiresAtMs }
}

tryAcquire({ resourceId, owner, ttlS }) {
  const now = Date.now();
  const lease = this.leases.get(resourceId);
  if (lease && lease.expiresAtMs > now && lease.owner !== owner) return false;
  this.leases.set(resourceId, { owner, expiresAtMs: now + ttlS * 1000 });
  return true;
}

release({ resourceId, owner }) {
  const lease = this.leases.get(resourceId);
  if (lease && lease.owner === owner) this.leases.delete(resourceId);
}
}

This doesn’t solve every deadlock (cycles are still cycles), but it prevents the worst kind: “the system is stuck because an agent died holding the lock”.

Also: put timeouts on “agent waits”. A wait without a timeout is a sleep you pay for.

Example failure case (incident-style, numbers are illustrative)

We had a multi-agent “incident triage” flow:

  • Agent A collected signals
  • Agent B wrote a hypothesis
  • Agent C validated against a runbook

When the runbook tool degraded, Agent C waited for a response. Agent B waited on Agent C. Agent A waited on Agent B.

Impact:

  • 43 runs stuck in “waiting” state
  • workers saturated and new requests queued
  • on-call burned ~2 hours manually canceling runs and clearing state (example)

Fix:

  1. timeouts on inter-agent waits
  2. orchestrator-owned leases per incident id
  3. stop reasons: “blocked waiting for tool” vs “blocked waiting for approval”
  4. fallback: single-agent mode when dependencies are degraded

Multi-agent makes coordination your problem. You don’t get to outsource it to the LLM.

Trade-offs

  • Orchestration code is work. It’s cheaper than deadlocks.
  • Leases can expire mid-work; you need idempotency and replay.
  • Single-agent fallback reduces quality, increases liveness.

When NOT to use

  • If the task is small, multi-agent is unnecessary overhead.
  • If you can’t build orchestration and observability, don’t ship multi-agent in prod.
  • If you need strict ordering and consistency, use workflows with explicit state machines.

Copy-paste checklist

  • [ ] Avoid circular dependencies between agents (write it down as a graph)
  • [ ] Add timeouts to “waiting” states
  • [ ] Use leases/TTLs for shared resources
  • [ ] One orchestrator owns state transitions
  • [ ] Idempotency keys for any writes
  • [ ] Stop reasons for blocked states + alerting
  • [ ] Fallback mode when dependencies degrade

Safe default config snippet (JSON/YAML)

YAML
multi_agent:
  orchestrator: "single_owner"
  wait_timeouts_s: { default: 30 }
  leases:
    ttl_s: 30
    renew: true
fallback:
  enabled: true
  mode: "single_agent"

FAQ (3–5)

Are multi-agent systems always a bad idea?
No. They can help for complex tasks, but they add coordination and failure modes. Plan for orchestration.
Do leases fix deadlocks?
They fix lock-based deadlocks caused by crashes. They don’t fix logical cycles — avoid cycles with explicit design.
What’s the simplest prevention?
A single orchestrator + timeouts on waits. Without timeouts, ‘waiting’ becomes ‘stuck’.
How do I debug deadlocks?
Log state transitions with run_id and a dependency graph. If you can’t draw the wait chain, you’re guessing.

Not sure this is your use case?

Design your agent ->
⏱️ 6 min readUpdated Mar, 2026Difficulty: ★★☆
Implement in OnceOnly
Guardrails for loops, retries, and spend escalation.
Use in OnceOnly
# onceonly guardrails (concept)
version: 1
budgets:
  max_steps: 25
  max_tool_calls: 12
  max_seconds: 60
  max_usd: 1.00
policy:
  tool_allowlist:
    - search.read
    - http.get
controls:
  loop_detection:
    enabled: true
    dedupe_by: [tool, args_hash]
  retries:
    max: 2
    backoff_ms: [200, 800]
stop_reasons:
  enabled: true
logging:
  tool_calls: { enabled: true, store_args: false, store_args_hash: true }
Integrated: production controlOnceOnly
Add guardrails to tool-calling agents
Ship this pattern with governance:
  • Budgets (steps / spend caps)
  • Kill switch & incident stop
  • Audit logs & traceability
  • Idempotency & dedupe
  • Tool permissions (allowlist / blocklist)
Integrated mention: OnceOnly is a control layer for production agent systems.
Example policy (concept)
# Example (Python — conceptual)
policy = {
  "budgets": {"steps": 20, "seconds": 60, "usd": 1.0},
  "controls": {"kill_switch": True, "audit": True},
}
Author

This documentation is curated and maintained by engineers who ship AI agents in production.

The content is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Patterns and recommendations are grounded in post-mortems, failure modes, and operational incidents in deployed systems, including during the development and operation of governance infrastructure for agents at OnceOnly.