AI Agent Infinite Loop (How to Detect + Fix, With Code)

  • Spot the failure early before the bill climbs.
  • Learn what breaks in production and why.
  • Copy guardrails: budgets, stop reasons, validation.
  • Know when this isn’t the real root cause.
Detection signals
  • Tool calls per run spikes (or repeats with same args hash).
  • Spend or tokens per request climbs without better outputs.
  • Retries shift from rare to constant (429/5xx).
Your agent is looping. It’s 03:00. The bill is climbing. Here’s what causes loops, what breaks, and the kill-switches we actually use.
On this page
  1. Quick take
  2. The problem
  3. Why this happens in real systems
  4. What breaks if you ignore it
  5. Code: loop detection + budgets (the parts you’ll be glad you wrote)
  6. Loops aren’t one thing (a quick taxonomy)
  7. 1) Hard loop (same action, same args)
  8. 2) Soft loop (same intent, tiny arg changes)
  9. 3) Semantic loop (progress illusion)
  10. Detection signals we actually use
  11. Progress budgets (the guard most teams forget)
  12. The kill switch (don’t make it a code deploy)
  13. Cancellation (because users leave, but your agent keeps spending)
  14. What to show the user when you stop
  15. A production-shaped loop guard (TypeScript sketch)
  16. Example incident (numbers are illustrative)
  17. Fix checklist (the boring moves that stop the bleeding)
  18. 1) Add a hard global budget
  19. 2) Add per-tool budgets
  20. 3) Add repeat detection (args hash)
  21. 4) Classify errors (retryable vs fatal)
  22. 5) Add a kill switch you can use *while sleepy*
  23. 6) Return partial results instead of “stopped”
  24. 7) Measure “loop rate”
  25. The expensive lesson: loops are an ops problem
  26. Picking budget numbers (so you don’t guess forever)
  27. Canonicalize args (or your loop guard won’t catch anything)
  28. Why people do this wrong
  29. Trade-offs
  30. When NOT to use “just add more retries”
  31. Link it up
Interactive flow
Scenario:
Step 1/2: Execution

Normal path: execute → tool → observe.

Quick take

  • Make loops cheap to stop: max steps, max seconds, max tool calls.
  • Detect repeats (tool + canonical args) and “no progress” streaks.
  • Classify errors (retryable vs fatal) and don’t double-retry (agent + tool).
  • Add an operator kill switch and propagate cancellation (client disconnect).

The problem

Your agent is looping forever.

Maybe it already cost you $200. Maybe it’s also hammering a third-party API and getting you rate-limited.

Either way, it’s not “thinking”. It’s stuck.

Diagram
Loop guard (what stops infinite runs)

Why this happens in real systems

Loops are usually one of these:

  • the tool is flaky (timeouts/429s) and the model keeps retrying
  • the prompt rewards “try again” instead of “stop”
  • the agent has no explicit stop condition (“keep going until solved”)
  • state isn’t persisted, so it repeats the same step after every error
  • your “planner” emits the same action because the observation is useless

What breaks if you ignore it

  • Cost and latency explode.
  • External systems throttle you.
  • You lose customer trust because the UI spins forever.
  • Your incident response gets harder because logs are unstructured.

Code: loop detection + budgets (the parts you’ll be glad you wrote)

PYTHON
from dataclasses import dataclass
import time
from typing import Any


@dataclass(frozen=True)
class Budget:
  max_steps: int = 25
  max_seconds: int = 60


class KillSwitch(RuntimeError):
  pass


class LoopDetected(RuntimeError):
  pass


class Monitor:
  def __init__(self, *, max_repeat: int = 3):
      self.max_repeat = max_repeat
      self.counts: dict[str, int] = {}

  def mark(self, key: str) -> None:
      self.counts[key] = self.counts.get(key, 0) + 1
      if self.counts[key] >= self.max_repeat:
          raise LoopDetected(f"loop: repeated {self.max_repeat}x: {key}")


def run_with_limits(task: str, *, budget: Budget, kill_switch, tools) -> str:
  started = time.time()
  monitor = Monitor(max_repeat=3)

  for step in range(budget.max_steps):
      if kill_switch.is_on():  # (pseudo)
          raise KillSwitch("killed by operator")
      if time.time() - started > budget.max_seconds:
          return "stopped: time budget exceeded"

      action = decide_next_action(task)  # (pseudo)
      monitor.mark(f"{action.name}:{action.args}")

      obs = tools.call(action.name, args=action.args)  # must be safe
      task = update_state(task, action, obs)  # (pseudo)

  return "stopped: step budget exceeded"
JAVASCRIPT
export class KillSwitch extends Error {}
export class LoopDetected extends Error {}

export class Monitor {
constructor({ maxRepeat = 3 } = {}) {
  this.maxRepeat = maxRepeat;
  this.counts = new Map();
}

mark(key) {
  const next = (this.counts.get(key) || 0) + 1;
  this.counts.set(key, next);
  if (next >= this.maxRepeat) throw new LoopDetected("loop: repeated " + this.maxRepeat + "x: " + key);
}
}

export async function runWithLimits(task, { budget, killSwitch, tools }) {
const started = Date.now();
const monitor = new Monitor({ maxRepeat: 3 });

for (let step = 0; step < budget.max_steps; step++) {
  if (killSwitch && killSwitch.isOn && killSwitch.isOn()) {
    throw new KillSwitch("killed by operator");
  }
  if ((Date.now() - started) / 1000 > budget.max_seconds) {
    return "stopped: time budget exceeded";
  }

  const action = await decideNextAction(task); // (pseudo)
  monitor.mark(String(action.name) + ":" + JSON.stringify(action.args));

  const obs = await tools.call(action.name, { args: action.args }); // must be safe
  task = updateState(task, action, obs); // (pseudo)
}

return "stopped: step budget exceeded";
}

Loops aren’t one thing (a quick taxonomy)

When someone says “the agent is looping”, they usually mean one of these:

1) Hard loop (same action, same args)

Example:

  • web.search(q="X") → timeout
  • web.search(q="X") → timeout
  • repeat until you go broke

This is the easiest to detect and the easiest to fix: hash the (tool, args) and stop after N repeats.

2) Soft loop (same intent, tiny arg changes)

Example:

  • web.search(q="X")
  • web.search(q="X site:foo.com")
  • web.search(q="X foo")
  • web.search(q="X latest")

The model is “trying different things”, but it’s not making progress.

Fix:

  • budget unique searches
  • detect “no new observations” for N steps
  • force it to write a partial answer or escalate

3) Semantic loop (progress illusion)

Example:

  • it keeps “summarizing” the same content in different words
  • it keeps asking the same question to the user
  • it keeps “planning” without acting

This one is insidious because logs can look busy while nothing changes.

Fix:

  • measure progress explicitly (new facts extracted, new URLs, new artifacts)
  • stop when the progress metric stays flat

Detection signals we actually use

There’s no single magic detector. You want a handful of cheap signals:

  1. Repeat signature: same tool+args hash N times
  2. No new artifacts: no new notes/URLs/tickets in N steps
  3. Cost slope: cost per step climbing with no gain
  4. Time slope: wall time climbing with no gain
  5. External pressure: 429s / throttling from dependencies

You can implement 80% of this with counters and hashes.

Progress budgets (the guard most teams forget)

Step budgets are blunt. They stop the bleeding, but they don’t tell you why the agent is stuck.

Progress budgets are what we use to keep “busy loops” from burning money:

  • if we’re not learning anything new, we stop
  • if we’re not producing new artifacts, we stop
  • if the tool layer keeps returning the same shaped error, we stop

Progress depends on the section:

  • Research agent: new unique URLs / citations, not “more summaries”
  • Support agent: state transitions (triage → reproduce → propose fix), not “ask the user again”
  • Browser tool: new DOM targets / extracted fields, not “scroll more”

Here’s a simple version:

PYTHON
class Progress:
  def __init__(self, *, max_flat_steps: int = 5):
      self.max_flat_steps = max_flat_steps
      self.seen_urls: set[str] = set()
      self.flat_steps = 0

  def update(self, observation: dict[str, Any]) -> None:
      urls = set(observation.get("urls", []))
      new = urls - self.seen_urls
      if new:
          self.seen_urls |= new
          self.flat_steps = 0
          return

      self.flat_steps += 1
      if self.flat_steps >= self.max_flat_steps:
          raise LoopDetected("no progress: no new urls/artifacts")
JAVASCRIPT
export class LoopDetected extends Error {}

export class Progress {
constructor({ maxFlatSteps = 5 } = {}) {
  this.maxFlatSteps = maxFlatSteps;
  this.seenUrls = new Set();
  this.flatSteps = 0;
}

update(observation) {
  const urls = new Set((observation && observation.urls) || []);
  let hasNew = false;
  for (const u of urls) {
    if (!this.seenUrls.has(u)) {
      this.seenUrls.add(u);
      hasNew = true;
    }
  }

  if (hasNew) {
    this.flatSteps = 0;
    return;
  }

  this.flatSteps += 1;
  if (this.flatSteps >= this.maxFlatSteps) {
    throw new LoopDetected("no progress: no new urls/artifacts");
  }
}
}

Is it perfect? No. Does it catch “search the same thing forever” loops? Absolutely.

The biggest mindset shift: treat “no progress” as a valid stop condition.

The kill switch (don’t make it a code deploy)

Every production agent needs an operator stop:

  • one click in a dashboard
  • stops current runs
  • optionally blocks new runs for a route/tenant/tool

If your kill switch requires a deploy, it’s not a kill switch. It’s a wish.

Practical add-ons:

  • route-level circuit breaker (“disable browser tool for tenant X”)
  • tool-level circuit breaker (“disable browser.get globally”)
  • spend breaker (“stop runs if $/min crosses threshold”)

Cancellation (because users leave, but your agent keeps spending)

One of the dumbest ways to lose money: the user closes the tab, but the agent keeps running.

If you don’t wire cancellation through your stack, your system will happily:

  • keep calling tools
  • keep paying token costs
  • and then throw the result away because nobody’s listening

At minimum:

  • cancel on client disconnect
  • propagate the signal into model calls and tool calls
  • log stop_reason = client_cancel (so you can see it)
TS
export async function handler(req: Request) {
  const controller = new AbortController();
  // pseudo: when the client disconnects, abort
  onClientDisconnect(req, () => controller.abort());

  return runAgent("...", { signal: controller.signal });
}

This isn’t “nice to have”. It’s cost control.

What to show the user when you stop

If you just return “stopped”, users will hit refresh and you’ll loop again.

Return something actionable:

  • stop reason (time/steps/loop detected/policy denied)
  • what was tried (top 5 tool calls)
  • partial output (notes, URLs, drafts)
  • next action (“try again later”, “need human approval”, “tool down”)

This reduces repeated runs more than any prompt tweak.

A production-shaped loop guard (TypeScript sketch)

TS
type ToolCall = { tool: string; argsHash: string; ms: number; ok: boolean };

export class LoopGuard {
  private counts = new Map<string, number>();
  private recent: ToolCall[] = [];

  constructor(private maxRepeat: number) {}

  record(call: ToolCall) {
    const key = `${call.tool}:${call.argsHash}`;
    const next = (this.counts.get(key) ?? 0) + 1;
    this.counts.set(key, next);
    this.recent.push(call);
    if (this.recent.length > 50) this.recent.shift();

    if (next >= this.maxRepeat) {
      throw new Error(`loop detected: ${key} repeated ${next}x`);
    }
  }
}

This is not “AI magic”. It’s just the minimum maturity to run a loop with tools.

Example incident (numbers are illustrative)

One team shipped an agent that tried to “log in and retry”. Login started returning 401s because a cookie format changed.

The agent:

  • kept retrying login every ~2 seconds
  • ran for ~30 minutes (no budget)
  • triggered account lockouts
  • burned ~$70 in browser/tool credits across a few dozen runs

Fix:

  • step/time budgets
  • loop detection on repeated tool args
  • treat auth errors as fatal (don’t retry forever)

Fix checklist (the boring moves that stop the bleeding)

When you’re in an incident, you don’t want theory. You want knobs.

Here’s the checklist we use to stop loops quickly:

1) Add a hard global budget

  • max steps (e.g., 25)
  • max seconds (e.g., 60)

If you don’t know what numbers to pick:

  • start conservative (lower)
  • measure completion rate vs cost
  • raise carefully

2) Add per-tool budgets

Global budgets stop the run. Per-tool budgets stop the runaway dependency.

Examples:

  • browser.get: max 6 calls
  • web.search: max 3 calls
  • db.read: max 10 calls

Why? Because the most common loop is “one flaky tool ruins the run”.

3) Add repeat detection (args hash)

Stop after N repeats of the same signature:

  • tool name + canonical args

This catches the obvious loops fast.

4) Classify errors (retryable vs fatal)

Most teams treat “error” as “try again”.

In production:

  • 429 is not a suggestion. It’s a stop signal.
  • auth errors (401/403) should usually be fatal
  • validation errors should be fatal (fix inputs, don’t brute force)

5) Add a kill switch you can use while sleepy

This is not a “feature”. This is the thing that prevents “we had to deploy at 03:00 to stop it”.

Add:

  • a global kill switch
  • a tool-level kill switch (disable browser)
  • a tenant-level kill switch (stop one noisy customer)

6) Return partial results instead of “stopped”

If the user sees “stopped”, they refresh. If they refresh, you loop again.

Return:

  • why you stopped
  • what you tried
  • the partial artifacts (notes, URLs, drafts)

7) Measure “loop rate”

We track:

  • percentage of runs that hit the budget
  • percentage of runs that trigger repeat detection
  • top tool signatures that repeat

This tells you which tool is flaky and which prompt is pushing “try again”.

The expensive lesson: loops are an ops problem

A lot of teams try to solve loops by:

  • rewriting prompts
  • switching models
  • adding more reasoning steps

Sometimes that helps. But most loops are caused by:

  • flaky dependencies
  • missing budgets
  • missing stop reasons
  • missing policy

That’s ops. Build the stack.

Picking budget numbers (so you don’t guess forever)

People always ask: “What should max_steps be?”

Annoying answer: it depends. Useful answer: start with a default and tune with metrics.

Here’s a reasonable starting point for many agents:

  • max steps: 20–30
  • max seconds: 30–90
  • max tool calls: ~10–25 total, with per-tool caps

Then tune based on what you see:

  • if completion rate is low and runs end by time budget → increase seconds a bit or reduce tool latency
  • if completion rate is high but costs are scary → tighten budgets and improve caching/dedupe
  • if loops are common → improve repeat detection and error classification

The trick: tighten budgets until it breaks, then fix the agent. Don’t loosen budgets until it “works”. That’s how you pay for bugs.

Canonicalize args (or your loop guard won’t catch anything)

Repeat detection sounds easy until you realize your args are never identical.

Common ways teams accidentally defeat their own loop guards:

  • they include timestamp or request_id inside tool args
  • they pass unordered JSON and hash raw strings (key order changes)
  • they include “debug” fields that change every run

We had a real incident where repeat detection was “enabled”… but every tool call included a random nonce. So the signature was always unique. The agent looped, our detector stayed quiet, and we learned a fun lesson about false confidence.

Fix:

  • canonicalize args before hashing (sort keys, drop volatile fields)
  • hash the parts that matter (query, url, ids) not the whole blob
  • treat “same intent” repeats as a loop too (soft loop)

If your agent is calling the same endpoint with different whitespace, you still want the guard to trip. Also: log the canonical signature you hashed. Otherwise you’ll stare at a dashboard and still not know what repeated.

Why people do this wrong

  • They add retries everywhere “for reliability”.
  • They don’t distinguish retryable vs fatal errors.
  • They only log the final output, not the action trace.

Trade-offs

  • Aggressive loop detection can stop legitimate repeated actions.
  • Tight budgets can reduce completion rate on hard tasks.
  • The alternative is worse: unbounded spend and infinite latency.

When NOT to use “just add more retries”

If a tool is failing consistently, retries are not reliability. They’re denial.

Not sure this is your use case?

Design your agent ->
⏱️ 12 min readUpdated Mar, 2026Difficulty: ★★☆
Implement in OnceOnly
Guardrails for loops, retries, and spend escalation.
Use in OnceOnly
# onceonly guardrails (concept)
version: 1
budgets:
  max_steps: 25
  max_tool_calls: 12
  max_seconds: 60
  max_usd: 1.00
policy:
  tool_allowlist:
    - search.read
    - http.get
controls:
  loop_detection:
    enabled: true
    dedupe_by: [tool, args_hash]
  retries:
    max: 2
    backoff_ms: [200, 800]
stop_reasons:
  enabled: true
logging:
  tool_calls: { enabled: true, store_args: false, store_args_hash: true }
Integrated: production controlOnceOnly
Add guardrails to tool-calling agents
Ship this pattern with governance:
  • Budgets (steps / spend caps)
  • Kill switch & incident stop
  • Audit logs & traceability
  • Idempotency & dedupe
  • Tool permissions (allowlist / blocklist)
Integrated mention: OnceOnly is a control layer for production agent systems.
Example policy (concept)
# Example (Python — conceptual)
policy = {
  "budgets": {"steps": 20, "seconds": 60, "usd": 1.0},
  "controls": {"kill_switch": True, "audit": True},
}
Author

This documentation is curated and maintained by engineers who ship AI agents in production.

The content is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Patterns and recommendations are grounded in post-mortems, failure modes, and operational incidents in deployed systems, including during the development and operation of governance infrastructure for agents at OnceOnly.