Budget Explosion (When Agents Burn Money) + Fixes + Code

Spot the failure early before the bill climbs.
Learn what breaks in production and why.
Copy guardrails: budgets, stop reasons, validation.
Know when this isn’t the real root cause.

Detection signals

Tool calls per run spikes (or repeats with same args hash).
Spend or tokens per request climbs without better outputs.
Retries shift from rare to constant (429/5xx).

Budgets don’t fail all at once. They leak via retries, prompt bloat, and tool spam. Here’s how budget explosions happen in production and how to cap spend per run.

On this page

Problem-first intro
Quick take
Why this fails in production
1) Tokens scale with context, not with intent
2) Retries multiply cost
3) “Planning” is pure overhead
4) Tool spam makes budgets meaningless
5) You don’t know spend unless you log it
Implementation example (real code)
Example failure case (incident-style, numbers are illustrative)
Trade-offs
When NOT to use
Copy-paste checklist
Safe default config snippet (JSON/YAML)
FAQ (3–5)
Related pages (3–6 links)

Interactive flow

Scenario:

Step 1/2: Execution

Normal path: execute → tool → observe.

Problem-first intro

You ship an agent.

It costs “a few cents” in testing.

Then it hits production traffic and someone posts in Slack:

“Why did we spend $900 on the agent yesterday?”

Budget explosions are rarely one big bug. They’re death by a thousand cuts:

token usage drifts up
retries multiply
tool calls become loops
prompts get bigger “just this once”

If you don’t measure and cap budgets, you’ll learn about spend from finance. Finance is not a monitoring system.

Quick take

Budgets leak via prompt bloat + retries + tool spam, not one big “bug”.
Cap time, steps, tool calls, and spend per run, and always return a stop reason.
Track tokens + tool calls + estimated cost per run so you can alert before finance does.

Why this fails in production

Costs compound in agent systems.

1) Tokens scale with context, not with intent

Intent: “summarize this”. Implementation: “paste the last 40 messages + 6 tool outputs + 2 runbooks”.

Token costs scale with what you feed the model, not what the user asked.

2) Retries multiply cost

If a model call fails and you retry:

you pay twice
you add latency

If a tool call fails and you retry:

you pay in tool costs
and you often pay in more model tokens because you explain the failure

Retries are not free. In agent loops they’re multiplicative.

3) “Planning” is pure overhead

Planning-heavy agents burn tokens before doing anything useful. That’s fine when it prevents tool spam. It’s not fine when it’s just “more thinking”.

4) Tool spam makes budgets meaningless

If you don’t cap tool calls, the agent can spend $0.01 on model tokens and $5 on tools. Your “token budget” didn’t protect you. Because it wasn’t the budget you needed.

5) You don’t know spend unless you log it

If your logs don’t include:

model tokens in/out
tool calls count
per-run cost estimate
stop reason

…you can’t alert on spend drift.

Diagram

What budgets must cover

Implementation example (real code)

This is a minimal per-run budget tracker:

stops on time, steps, tool calls
estimates cost (roughly) and stops on spend
returns a stop reason you can alert on

PythonJS

PYTHON

from dataclasses import dataclass
import time


@dataclass(frozen=True)
class Budget:
  max_steps: int = 25
  max_seconds: int = 60
  max_tool_calls: int = 12
  max_usd: float = 1.00


@dataclass
class Usage:
  tool_calls: int = 0
  model_tokens_in: int = 0
  model_tokens_out: int = 0
  estimated_usd: float = 0.0


class BudgetExceeded(RuntimeError):
  pass


def estimate_usd(tokens_in: int, tokens_out: int) -> float:
  # Replace with real pricing for your model(s).
  # This is a placeholder to show the pattern; pricing varies by provider and model.
  return (tokens_in + tokens_out) * 0.000002  # $/token (placeholder)


class BudgetGuard:
  def __init__(self, budget: Budget) -> None:
      self.budget = budget
      self.usage = Usage()
      self.started = time.time()
      self.steps = 0

  def check_step(self) -> None:
      self.steps += 1
      if self.steps > self.budget.max_steps:
          raise BudgetExceeded("step budget exceeded")
      if time.time() - self.started > self.budget.max_seconds:
          raise BudgetExceeded("time budget exceeded")

  def on_tool_call(self) -> None:
      self.usage.tool_calls += 1
      if self.usage.tool_calls > self.budget.max_tool_calls:
          raise BudgetExceeded("tool budget exceeded")

  def on_model_call(self, *, tokens_in: int, tokens_out: int) -> None:
      self.usage.model_tokens_in += tokens_in
      self.usage.model_tokens_out += tokens_out
      self.usage.estimated_usd = estimate_usd(
          self.usage.model_tokens_in, self.usage.model_tokens_out
      )
      if self.usage.estimated_usd > self.budget.max_usd:
          raise BudgetExceeded("cost budget exceeded")


def run(task: str, *, budget: Budget) -> str:
  guard = BudgetGuard(budget)

  while True:
      guard.check_step()

      # model call (pseudo)
      action, tokens_in, tokens_out = llm_decide(task)  # (pseudo)
      guard.on_model_call(tokens_in=tokens_in, tokens_out=tokens_out)

      if action.kind == "tool":
          guard.on_tool_call()
          result = call_tool(action.name, action.args)  # (pseudo)
          task = update_state(task, action, result)  # (pseudo)
      else:
          return action.final_answer

JAVASCRIPT

export class BudgetExceeded extends Error {}

export class BudgetGuard {
constructor(budget) {
  this.budget = budget;
  this.started = Date.now();
  this.steps = 0;
  this.usage = { toolCalls: 0, tokensIn: 0, tokensOut: 0, estimatedUsd: 0 };
}

estimateUsd(tokensIn, tokensOut) {
  // Replace with real pricing for your provider/model.
  // This is a placeholder to show the pattern.
  return (tokensIn + tokensOut) * 0.000002;
}

checkStep() {
  this.steps += 1;
  const elapsedS = (Date.now() - this.started) / 1000;
  if (this.steps > this.budget.maxSteps) throw new BudgetExceeded("step budget exceeded");
  if (elapsedS > this.budget.maxSeconds) throw new BudgetExceeded("time budget exceeded");
}

onToolCall() {
  this.usage.toolCalls += 1;
  if (this.usage.toolCalls > this.budget.maxToolCalls) throw new BudgetExceeded("tool budget exceeded");
}

onModelCall({ tokensIn, tokensOut }) {
  this.usage.tokensIn += tokensIn;
  this.usage.tokensOut += tokensOut;
  this.usage.estimatedUsd = this.estimateUsd(this.usage.tokensIn, this.usage.tokensOut);
  if (this.usage.estimatedUsd > this.budget.maxUsd) throw new BudgetExceeded("cost budget exceeded");
}
}

export function run(task, { budget }) {
const guard = new BudgetGuard(budget);

while (true) {
  guard.checkStep();

  // model call (pseudo)
  const { action, tokensIn, tokensOut } = llmDecide(task); // (pseudo)
  guard.onModelCall({ tokensIn, tokensOut });

  if (action.kind === "tool") {
    guard.onToolCall();
    const result = callTool(action.name, action.args); // (pseudo)
    task = updateState(task, action, result); // (pseudo)
  } else {
    return action.final_answer;
  }
}
}

The key detail: budgets are checked continuously, not just at the end. You want to stop before you hit the cliff.

Example failure case (incident-style, numbers are illustrative)

We had an agent that ran fine in dev at ~3k tokens/request.

Then we added “helpful context”:

last 20 user messages
full tool outputs (including HTML)
a runbook snippet

Prompt size drifted. Nobody noticed.

Impact over 48 hours (example numbers):

median tokens/request: 3k → 16k
p95 latency: 2.4s → 8.9s
spend: +$740 vs baseline

Fix:

hard budgets (tokens, tool calls, time, spend)
prompt builder with caps + summarization
alerting on tokens/request and spend/run
safe-mode fallback when budgets hit

This wasn’t “the model got worse”. We fed it more and hoped the bill wouldn’t notice.

Trade-offs

Tight budgets increase “stopped early” responses. That’s fine — better than runaway spend.
Spend estimation is approximate. It doesn’t need to be perfect to be useful.
Summaries save tokens but can lose nuance. Use them where it’s safe.

When NOT to use

If you can’t estimate cost at all (multiple models/tools), start with time/tool budgets first.
If the workload is deterministic, a workflow with fixed costs is a better choice.
If you need long-context reasoning, plan for a bigger budget and make it explicit.

Copy-paste checklist

[ ] Budgets: steps, tool calls, seconds, USD
[ ] Track tokens in/out per run
[ ] Estimate spend per run and alert on spikes
[ ] Cap retries (model + tool)
[ ] Cap untrusted text size (HTML/tool dumps)
[ ] Summarize or truncate over-budget context
[ ] Return a stop reason (don’t silently timeout)

Safe default config snippet (JSON/YAML)

YAML

budgets:
  max_steps: 25
  max_seconds: 60
  max_tool_calls: 12
  max_usd: 1.0
llm:
  retries: { max_attempts: 2 }
context:
  max_prompt_tokens: 2500
  summarize_when_over_budget: true

FAQ (3–5)

Used by patterns

Related failures

Governance required

Do I need exact cost accounting for budgets?

No. Guards can be approximate. The goal is to stop runaway runs before they become invoices.

What budget should I start with?

Time + tool calls. Then add token/spend once you can measure them.

How do I handle requests that need more budget?

Escalate: ask for confirmation, switch to a bigger budget tier, or run async with user-visible status.

Can I just set a huge budget and forget it?

You can, but you’re back to learning about failures from finance and on-call.

Foundations: How LLM limits affect agents · What makes an agent production-ready
Failure: Tool spam loops · Infinite loop
Governance: Tool permissions (allowlists)
Production stack: Production agent stack

Not sure this is your use case?

Design your agent ->