AI Agent Infinite Loop (How to Detect + Fix, With Code)

Spot the failure early before the bill climbs.
Learn what breaks in production and why.
Copy guardrails: budgets, stop reasons, validation.
Know when this isn’t the real root cause.

Detection signals

Tool calls per run spikes (or repeats with same args hash).
Spend or tokens per request climbs without better outputs.
Retries shift from rare to constant (429/5xx).

Your agent is looping. It’s 03:00. The bill is climbing. Here’s what causes loops, what breaks, and the kill-switches we actually use.

On this page

Quick take
The problem
Why this happens in real systems
What breaks if you ignore it
Code: loop detection + budgets (the parts you’ll be glad you wrote)
Loops aren’t one thing (a quick taxonomy)
1) Hard loop (same action, same args)
2) Soft loop (same intent, tiny arg changes)
3) Semantic loop (progress illusion)
Detection signals we actually use
Progress budgets (the guard most teams forget)
The kill switch (don’t make it a code deploy)
Cancellation (because users leave, but your agent keeps spending)
What to show the user when you stop
A production-shaped loop guard (TypeScript sketch)
Example incident (numbers are illustrative)
Fix checklist (the boring moves that stop the bleeding)
1) Add a hard global budget
2) Add per-tool budgets
3) Add repeat detection (args hash)
4) Classify errors (retryable vs fatal)
5) Add a kill switch you can use *while sleepy*
6) Return partial results instead of “stopped”
7) Measure “loop rate”
The expensive lesson: loops are an ops problem
Picking budget numbers (so you don’t guess forever)
Canonicalize args (or your loop guard won’t catch anything)
Why people do this wrong
Trade-offs
When NOT to use “just add more retries”
Link it up

Interactive flow

Scenario:

Step 1/2: Execution

Normal path: execute → tool → observe.

Quick take

Make loops cheap to stop: max steps, max seconds, max tool calls.
Detect repeats (tool + canonical args) and “no progress” streaks.
Classify errors (retryable vs fatal) and don’t double-retry (agent + tool).
Add an operator kill switch and propagate cancellation (client disconnect).

The problem

Your agent is looping forever.

Maybe it already cost you $200. Maybe it’s also hammering a third-party API and getting you rate-limited.

Either way, it’s not “thinking”. It’s stuck.

Diagram

Loop guard (what stops infinite runs)

Why this happens in real systems

Loops are usually one of these:

the tool is flaky (timeouts/429s) and the model keeps retrying
the prompt rewards “try again” instead of “stop”
the agent has no explicit stop condition (“keep going until solved”)
state isn’t persisted, so it repeats the same step after every error
your “planner” emits the same action because the observation is useless

What breaks if you ignore it

Cost and latency explode.
External systems throttle you.
You lose customer trust because the UI spins forever.
Your incident response gets harder because logs are unstructured.

Code: loop detection + budgets (the parts you’ll be glad you wrote)

PythonJS

PYTHON

from dataclasses import dataclass
import time
from typing import Any


@dataclass(frozen=True)
class Budget:
  max_steps: int = 25
  max_seconds: int = 60


class KillSwitch(RuntimeError):
  pass


class LoopDetected(RuntimeError):
  pass


class Monitor:
  def __init__(self, *, max_repeat: int = 3):
      self.max_repeat = max_repeat
      self.counts: dict[str, int] = {}

  def mark(self, key: str) -> None:
      self.counts[key] = self.counts.get(key, 0) + 1
      if self.counts[key] >= self.max_repeat:
          raise LoopDetected(f"loop: repeated {self.max_repeat}x: {key}")


def run_with_limits(task: str, *, budget: Budget, kill_switch, tools) -> str:
  started = time.time()
  monitor = Monitor(max_repeat=3)

  for step in range(budget.max_steps):
      if kill_switch.is_on():  # (pseudo)
          raise KillSwitch("killed by operator")
      if time.time() - started > budget.max_seconds:
          return "stopped: time budget exceeded"

      action = decide_next_action(task)  # (pseudo)
      monitor.mark(f"{action.name}:{action.args}")

      obs = tools.call(action.name, args=action.args)  # must be safe
      task = update_state(task, action, obs)  # (pseudo)

  return "stopped: step budget exceeded"

JAVASCRIPT

export class KillSwitch extends Error {}
export class LoopDetected extends Error {}

export class Monitor {
constructor({ maxRepeat = 3 } = {}) {
  this.maxRepeat = maxRepeat;
  this.counts = new Map();
}

mark(key) {
  const next = (this.counts.get(key) || 0) + 1;
  this.counts.set(key, next);
  if (next >= this.maxRepeat) throw new LoopDetected("loop: repeated " + this.maxRepeat + "x: " + key);
}
}

export async function runWithLimits(task, { budget, killSwitch, tools }) {
const started = Date.now();
const monitor = new Monitor({ maxRepeat: 3 });

for (let step = 0; step < budget.max_steps; step++) {
  if (killSwitch && killSwitch.isOn && killSwitch.isOn()) {
    throw new KillSwitch("killed by operator");
  }
  if ((Date.now() - started) / 1000 > budget.max_seconds) {
    return "stopped: time budget exceeded";
  }

  const action = await decideNextAction(task); // (pseudo)
  monitor.mark(String(action.name) + ":" + JSON.stringify(action.args));

  const obs = await tools.call(action.name, { args: action.args }); // must be safe
  task = updateState(task, action, obs); // (pseudo)
}

return "stopped: step budget exceeded";
}

Loops aren’t one thing (a quick taxonomy)

When someone says “the agent is looping”, they usually mean one of these:

1) Hard loop (same action, same args)

Example:

web.search(q="X") → timeout
web.search(q="X") → timeout
repeat until you go broke

This is the easiest to detect and the easiest to fix: hash the (tool, args) and stop after N repeats.

2) Soft loop (same intent, tiny arg changes)

Example:

web.search(q="X")
web.search(q="X site:foo.com")
web.search(q="X foo")
web.search(q="X latest")

The model is “trying different things”, but it’s not making progress.

Fix:

budget unique searches
detect “no new observations” for N steps
force it to write a partial answer or escalate

3) Semantic loop (progress illusion)

Example:

it keeps “summarizing” the same content in different words
it keeps asking the same question to the user
it keeps “planning” without acting

This one is insidious because logs can look busy while nothing changes.

Fix:

measure progress explicitly (new facts extracted, new URLs, new artifacts)
stop when the progress metric stays flat

Detection signals we actually use

There’s no single magic detector. You want a handful of cheap signals:

Repeat signature: same tool+args hash N times
No new artifacts: no new notes/URLs/tickets in N steps
Cost slope: cost per step climbing with no gain
Time slope: wall time climbing with no gain
External pressure: 429s / throttling from dependencies

You can implement 80% of this with counters and hashes.

Progress budgets (the guard most teams forget)

Step budgets are blunt. They stop the bleeding, but they don’t tell you why the agent is stuck.

Progress budgets are what we use to keep “busy loops” from burning money:

if we’re not learning anything new, we stop
if we’re not producing new artifacts, we stop
if the tool layer keeps returning the same shaped error, we stop

Progress depends on the section:

Research agent: new unique URLs / citations, not “more summaries”
Support agent: state transitions (triage → reproduce → propose fix), not “ask the user again”
Browser tool: new DOM targets / extracted fields, not “scroll more”

Here’s a simple version:

PythonJS

PYTHON

class Progress:
  def __init__(self, *, max_flat_steps: int = 5):
      self.max_flat_steps = max_flat_steps
      self.seen_urls: set[str] = set()
      self.flat_steps = 0

  def update(self, observation: dict[str, Any]) -> None:
      urls = set(observation.get("urls", []))
      new = urls - self.seen_urls
      if new:
          self.seen_urls |= new
          self.flat_steps = 0
          return

      self.flat_steps += 1
      if self.flat_steps >= self.max_flat_steps:
          raise LoopDetected("no progress: no new urls/artifacts")

JAVASCRIPT

export class LoopDetected extends Error {}

export class Progress {
constructor({ maxFlatSteps = 5 } = {}) {
  this.maxFlatSteps = maxFlatSteps;
  this.seenUrls = new Set();
  this.flatSteps = 0;
}

update(observation) {
  const urls = new Set((observation && observation.urls) || []);
  let hasNew = false;
  for (const u of urls) {
    if (!this.seenUrls.has(u)) {
      this.seenUrls.add(u);
      hasNew = true;
    }
  }

  if (hasNew) {
    this.flatSteps = 0;
    return;
  }

  this.flatSteps += 1;
  if (this.flatSteps >= this.maxFlatSteps) {
    throw new LoopDetected("no progress: no new urls/artifacts");
  }
}
}

Is it perfect? No. Does it catch “search the same thing forever” loops? Absolutely.

The biggest mindset shift: treat “no progress” as a valid stop condition.

The kill switch (don’t make it a code deploy)

Every production agent needs an operator stop:

one click in a dashboard
stops current runs
optionally blocks new runs for a route/tenant/tool

If your kill switch requires a deploy, it’s not a kill switch. It’s a wish.

Practical add-ons:

route-level circuit breaker (“disable browser tool for tenant X”)
tool-level circuit breaker (“disable browser.get globally”)
spend breaker (“stop runs if $/min crosses threshold”)

Cancellation (because users leave, but your agent keeps spending)

One of the dumbest ways to lose money: the user closes the tab, but the agent keeps running.

If you don’t wire cancellation through your stack, your system will happily:

keep calling tools
keep paying token costs
and then throw the result away because nobody’s listening

At minimum:

cancel on client disconnect
propagate the signal into model calls and tool calls
log stop_reason = client_cancel (so you can see it)

export async function handler(req: Request) {
  const controller = new AbortController();
  // pseudo: when the client disconnects, abort
  onClientDisconnect(req, () => controller.abort());

  return runAgent("...", { signal: controller.signal });
}

This isn’t “nice to have”. It’s cost control.

What to show the user when you stop

If you just return “stopped”, users will hit refresh and you’ll loop again.

Return something actionable:

stop reason (time/steps/loop detected/policy denied)
what was tried (top 5 tool calls)
partial output (notes, URLs, drafts)
next action (“try again later”, “need human approval”, “tool down”)

This reduces repeated runs more than any prompt tweak.

A production-shaped loop guard (TypeScript sketch)

type ToolCall = { tool: string; argsHash: string; ms: number; ok: boolean };

export class LoopGuard {
  private counts = new Map<string, number>();
  private recent: ToolCall[] = [];

  constructor(private maxRepeat: number) {}

  record(call: ToolCall) {
    const key = `${call.tool}:${call.argsHash}`;
    const next = (this.counts.get(key) ?? 0) + 1;
    this.counts.set(key, next);
    this.recent.push(call);
    if (this.recent.length > 50) this.recent.shift();

    if (next >= this.maxRepeat) {
      throw new Error(`loop detected: ${key} repeated ${next}x`);
    }
  }
}

This is not “AI magic”. It’s just the minimum maturity to run a loop with tools.

Example incident (numbers are illustrative)

One team shipped an agent that tried to “log in and retry”. Login started returning 401s because a cookie format changed.

The agent:

kept retrying login every ~2 seconds
ran for ~30 minutes (no budget)
triggered account lockouts
burned ~$70 in browser/tool credits across a few dozen runs

Fix:

step/time budgets
loop detection on repeated tool args
treat auth errors as fatal (don’t retry forever)

Fix checklist (the boring moves that stop the bleeding)

When you’re in an incident, you don’t want theory. You want knobs.

Here’s the checklist we use to stop loops quickly:

1) Add a hard global budget

max steps (e.g., 25)
max seconds (e.g., 60)

If you don’t know what numbers to pick:

start conservative (lower)
measure completion rate vs cost
raise carefully

2) Add per-tool budgets

Global budgets stop the run. Per-tool budgets stop the runaway dependency.

Examples:

browser.get: max 6 calls
web.search: max 3 calls
db.read: max 10 calls

Why? Because the most common loop is “one flaky tool ruins the run”.

3) Add repeat detection (args hash)

Stop after N repeats of the same signature:

tool name + canonical args

This catches the obvious loops fast.

4) Classify errors (retryable vs fatal)

Most teams treat “error” as “try again”.

In production:

429 is not a suggestion. It’s a stop signal.
auth errors (401/403) should usually be fatal
validation errors should be fatal (fix inputs, don’t brute force)

5) Add a kill switch you can use while sleepy

This is not a “feature”. This is the thing that prevents “we had to deploy at 03:00 to stop it”.

Add:

a global kill switch
a tool-level kill switch (disable browser)
a tenant-level kill switch (stop one noisy customer)

6) Return partial results instead of “stopped”

If the user sees “stopped”, they refresh. If they refresh, you loop again.

Return:

why you stopped
what you tried
the partial artifacts (notes, URLs, drafts)

7) Measure “loop rate”

We track:

percentage of runs that hit the budget
percentage of runs that trigger repeat detection
top tool signatures that repeat

This tells you which tool is flaky and which prompt is pushing “try again”.

The expensive lesson: loops are an ops problem

A lot of teams try to solve loops by:

rewriting prompts
switching models
adding more reasoning steps

Sometimes that helps. But most loops are caused by:

flaky dependencies
missing budgets
missing stop reasons
missing policy

That’s ops. Build the stack.

Picking budget numbers (so you don’t guess forever)

People always ask: “What should max_steps be?”

Annoying answer: it depends. Useful answer: start with a default and tune with metrics.

Here’s a reasonable starting point for many agents:

max steps: 20–30
max seconds: 30–90
max tool calls: ~10–25 total, with per-tool caps

Then tune based on what you see:

if completion rate is low and runs end by time budget → increase seconds a bit or reduce tool latency
if completion rate is high but costs are scary → tighten budgets and improve caching/dedupe
if loops are common → improve repeat detection and error classification

The trick: tighten budgets until it breaks, then fix the agent. Don’t loosen budgets until it “works”. That’s how you pay for bugs.

Canonicalize args (or your loop guard won’t catch anything)

Repeat detection sounds easy until you realize your args are never identical.

Common ways teams accidentally defeat their own loop guards:

they include timestamp or request_id inside tool args
they pass unordered JSON and hash raw strings (key order changes)
they include “debug” fields that change every run

We had a real incident where repeat detection was “enabled”… but every tool call included a random nonce. So the signature was always unique. The agent looped, our detector stayed quiet, and we learned a fun lesson about false confidence.

Fix:

canonicalize args before hashing (sort keys, drop volatile fields)
hash the parts that matter (query, url, ids) not the whole blob
treat “same intent” repeats as a loop too (soft loop)

If your agent is calling the same endpoint with different whitespace, you still want the guard to trip. Also: log the canonical signature you hashed. Otherwise you’ll stare at a dashboard and still not know what repeated.

Why people do this wrong

They add retries everywhere “for reliability”.
They don’t distinguish retryable vs fatal errors.
They only log the final output, not the action trace.

Trade-offs

Aggressive loop detection can stop legitimate repeated actions.
Tight budgets can reduce completion rate on hard tasks.
The alternative is worse: unbounded spend and infinite latency.

When NOT to use “just add more retries”

If a tool is failing consistently, retries are not reliability. They’re denial.

Link it up

Foundations: Tool calling
Patterns: Research agent
Security: Tool permissions

Not sure this is your use case?

Design your agent ->

Used by patterns

Related failures

Governance required

Implement in OnceOnly

Guardrails for loops, retries, and spend escalation.

Use in OnceOnly

# onceonly guardrails (concept)
version: 1
budgets:
  max_steps: 25
  max_tool_calls: 12
  max_seconds: 60
  max_usd: 1.00
policy:
  tool_allowlist:
    - search.read
    - http.get
controls:
  loop_detection:
    enabled: true
    dedupe_by: [tool, args_hash]
  retries:
    max: 2
    backoff_ms: [200, 800]
stop_reasons:
  enabled: true
logging:
  tool_calls: { enabled: true, store_args: false, store_args_hash: true }

Integrated: production controlOnceOnly

Add guardrails to tool-calling agents

Ship this pattern with governance:

Budgets (steps / spend caps)
Kill switch & incident stop
Audit logs & traceability
Idempotency & dedupe
Tool permissions (allowlist / blocklist)

Try OnceOnly Docs & examples

Integrated mention: OnceOnly is a control layer for production agent systems.

Example policy (concept)

# Example (Python — conceptual)
policy = {
  "budgets": {"steps": 20, "seconds": 60, "usd": 1.0},
  "controls": {"kill_switch": True, "audit": True},
}

Author

This documentation is curated and maintained by engineers who ship AI agents in production.

The content is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Patterns and recommendations are grounded in post-mortems, failure modes, and operational incidents in deployed systems, including during the development and operation of governance infrastructure for agents at OnceOnly.