Budget Explosion (When Agents Burn Money) + Fixes + Code

  • Spot the failure early before the bill climbs.
  • Learn what breaks in production and why.
  • Copy guardrails: budgets, stop reasons, validation.
  • Know when this isn’t the real root cause.
Detection signals
  • Tool calls per run spikes (or repeats with same args hash).
  • Spend or tokens per request climbs without better outputs.
  • Retries shift from rare to constant (429/5xx).
Budgets don’t fail all at once. They leak via retries, prompt bloat, and tool spam. Here’s how budget explosions happen in production and how to cap spend per run.
On this page
  1. Problem-first intro
  2. Quick take
  3. Why this fails in production
  4. 1) Tokens scale with context, not with intent
  5. 2) Retries multiply cost
  6. 3) “Planning” is pure overhead
  7. 4) Tool spam makes budgets meaningless
  8. 5) You don’t know spend unless you log it
  9. Implementation example (real code)
  10. Example failure case (incident-style, numbers are illustrative)
  11. Trade-offs
  12. When NOT to use
  13. Copy-paste checklist
  14. Safe default config snippet (JSON/YAML)
  15. FAQ (3–5)
  16. Related pages (3–6 links)
Interactive flow
Scenario:
Step 1/2: Execution

Normal path: execute → tool → observe.

Problem-first intro

You ship an agent.

It costs “a few cents” in testing.

Then it hits production traffic and someone posts in Slack:

“Why did we spend $900 on the agent yesterday?”

Budget explosions are rarely one big bug. They’re death by a thousand cuts:

  • token usage drifts up
  • retries multiply
  • tool calls become loops
  • prompts get bigger “just this once”

If you don’t measure and cap budgets, you’ll learn about spend from finance. Finance is not a monitoring system.

Quick take

  • Budgets leak via prompt bloat + retries + tool spam, not one big “bug”.
  • Cap time, steps, tool calls, and spend per run, and always return a stop reason.
  • Track tokens + tool calls + estimated cost per run so you can alert before finance does.

Why this fails in production

Costs compound in agent systems.

1) Tokens scale with context, not with intent

Intent: “summarize this”. Implementation: “paste the last 40 messages + 6 tool outputs + 2 runbooks”.

Token costs scale with what you feed the model, not what the user asked.

2) Retries multiply cost

If a model call fails and you retry:

  • you pay twice
  • you add latency

If a tool call fails and you retry:

  • you pay in tool costs
  • and you often pay in more model tokens because you explain the failure

Retries are not free. In agent loops they’re multiplicative.

3) “Planning” is pure overhead

Planning-heavy agents burn tokens before doing anything useful. That’s fine when it prevents tool spam. It’s not fine when it’s just “more thinking”.

4) Tool spam makes budgets meaningless

If you don’t cap tool calls, the agent can spend $0.01 on model tokens and $5 on tools. Your “token budget” didn’t protect you. Because it wasn’t the budget you needed.

5) You don’t know spend unless you log it

If your logs don’t include:

  • model tokens in/out
  • tool calls count
  • per-run cost estimate
  • stop reason

…you can’t alert on spend drift.

Diagram
What budgets must cover

Implementation example (real code)

This is a minimal per-run budget tracker:

  • stops on time, steps, tool calls
  • estimates cost (roughly) and stops on spend
  • returns a stop reason you can alert on
PYTHON
from dataclasses import dataclass
import time


@dataclass(frozen=True)
class Budget:
  max_steps: int = 25
  max_seconds: int = 60
  max_tool_calls: int = 12
  max_usd: float = 1.00


@dataclass
class Usage:
  tool_calls: int = 0
  model_tokens_in: int = 0
  model_tokens_out: int = 0
  estimated_usd: float = 0.0


class BudgetExceeded(RuntimeError):
  pass


def estimate_usd(tokens_in: int, tokens_out: int) -> float:
  # Replace with real pricing for your model(s).
  # This is a placeholder to show the pattern; pricing varies by provider and model.
  return (tokens_in + tokens_out) * 0.000002  # $/token (placeholder)


class BudgetGuard:
  def __init__(self, budget: Budget) -> None:
      self.budget = budget
      self.usage = Usage()
      self.started = time.time()
      self.steps = 0

  def check_step(self) -> None:
      self.steps += 1
      if self.steps > self.budget.max_steps:
          raise BudgetExceeded("step budget exceeded")
      if time.time() - self.started > self.budget.max_seconds:
          raise BudgetExceeded("time budget exceeded")

  def on_tool_call(self) -> None:
      self.usage.tool_calls += 1
      if self.usage.tool_calls > self.budget.max_tool_calls:
          raise BudgetExceeded("tool budget exceeded")

  def on_model_call(self, *, tokens_in: int, tokens_out: int) -> None:
      self.usage.model_tokens_in += tokens_in
      self.usage.model_tokens_out += tokens_out
      self.usage.estimated_usd = estimate_usd(
          self.usage.model_tokens_in, self.usage.model_tokens_out
      )
      if self.usage.estimated_usd > self.budget.max_usd:
          raise BudgetExceeded("cost budget exceeded")


def run(task: str, *, budget: Budget) -> str:
  guard = BudgetGuard(budget)

  while True:
      guard.check_step()

      # model call (pseudo)
      action, tokens_in, tokens_out = llm_decide(task)  # (pseudo)
      guard.on_model_call(tokens_in=tokens_in, tokens_out=tokens_out)

      if action.kind == "tool":
          guard.on_tool_call()
          result = call_tool(action.name, action.args)  # (pseudo)
          task = update_state(task, action, result)  # (pseudo)
      else:
          return action.final_answer
JAVASCRIPT
export class BudgetExceeded extends Error {}

export class BudgetGuard {
constructor(budget) {
  this.budget = budget;
  this.started = Date.now();
  this.steps = 0;
  this.usage = { toolCalls: 0, tokensIn: 0, tokensOut: 0, estimatedUsd: 0 };
}

estimateUsd(tokensIn, tokensOut) {
  // Replace with real pricing for your provider/model.
  // This is a placeholder to show the pattern.
  return (tokensIn + tokensOut) * 0.000002;
}

checkStep() {
  this.steps += 1;
  const elapsedS = (Date.now() - this.started) / 1000;
  if (this.steps > this.budget.maxSteps) throw new BudgetExceeded("step budget exceeded");
  if (elapsedS > this.budget.maxSeconds) throw new BudgetExceeded("time budget exceeded");
}

onToolCall() {
  this.usage.toolCalls += 1;
  if (this.usage.toolCalls > this.budget.maxToolCalls) throw new BudgetExceeded("tool budget exceeded");
}

onModelCall({ tokensIn, tokensOut }) {
  this.usage.tokensIn += tokensIn;
  this.usage.tokensOut += tokensOut;
  this.usage.estimatedUsd = this.estimateUsd(this.usage.tokensIn, this.usage.tokensOut);
  if (this.usage.estimatedUsd > this.budget.maxUsd) throw new BudgetExceeded("cost budget exceeded");
}
}

export function run(task, { budget }) {
const guard = new BudgetGuard(budget);

while (true) {
  guard.checkStep();

  // model call (pseudo)
  const { action, tokensIn, tokensOut } = llmDecide(task); // (pseudo)
  guard.onModelCall({ tokensIn, tokensOut });

  if (action.kind === "tool") {
    guard.onToolCall();
    const result = callTool(action.name, action.args); // (pseudo)
    task = updateState(task, action, result); // (pseudo)
  } else {
    return action.final_answer;
  }
}
}

The key detail: budgets are checked continuously, not just at the end. You want to stop before you hit the cliff.

Example failure case (incident-style, numbers are illustrative)

We had an agent that ran fine in dev at ~3k tokens/request.

Then we added “helpful context”:

  • last 20 user messages
  • full tool outputs (including HTML)
  • a runbook snippet

Prompt size drifted. Nobody noticed.

Impact over 48 hours (example numbers):

  • median tokens/request: 3k → 16k
  • p95 latency: 2.4s → 8.9s
  • spend: +$740 vs baseline

Fix:

  1. hard budgets (tokens, tool calls, time, spend)
  2. prompt builder with caps + summarization
  3. alerting on tokens/request and spend/run
  4. safe-mode fallback when budgets hit

This wasn’t “the model got worse”. We fed it more and hoped the bill wouldn’t notice.

Trade-offs

  • Tight budgets increase “stopped early” responses. That’s fine — better than runaway spend.
  • Spend estimation is approximate. It doesn’t need to be perfect to be useful.
  • Summaries save tokens but can lose nuance. Use them where it’s safe.

When NOT to use

  • If you can’t estimate cost at all (multiple models/tools), start with time/tool budgets first.
  • If the workload is deterministic, a workflow with fixed costs is a better choice.
  • If you need long-context reasoning, plan for a bigger budget and make it explicit.

Copy-paste checklist

  • [ ] Budgets: steps, tool calls, seconds, USD
  • [ ] Track tokens in/out per run
  • [ ] Estimate spend per run and alert on spikes
  • [ ] Cap retries (model + tool)
  • [ ] Cap untrusted text size (HTML/tool dumps)
  • [ ] Summarize or truncate over-budget context
  • [ ] Return a stop reason (don’t silently timeout)

Safe default config snippet (JSON/YAML)

YAML
budgets:
  max_steps: 25
  max_seconds: 60
  max_tool_calls: 12
  max_usd: 1.0
llm:
  retries: { max_attempts: 2 }
context:
  max_prompt_tokens: 2500
  summarize_when_over_budget: true

FAQ (3–5)

Do I need exact cost accounting for budgets?
No. Guards can be approximate. The goal is to stop runaway runs before they become invoices.
What budget should I start with?
Time + tool calls. Then add token/spend once you can measure them.
How do I handle requests that need more budget?
Escalate: ask for confirmation, switch to a bigger budget tier, or run async with user-visible status.
Can I just set a huge budget and forget it?
You can, but you’re back to learning about failures from finance and on-call.

Not sure this is your use case?

Design your agent ->
⏱️ 7 min readUpdated Mar, 2026Difficulty: ★★☆
Implement in OnceOnly
Guardrails for loops, retries, and spend escalation.
Use in OnceOnly
# onceonly guardrails (concept)
version: 1
budgets:
  max_steps: 25
  max_tool_calls: 12
  max_seconds: 60
  max_usd: 1.00
policy:
  tool_allowlist:
    - search.read
    - http.get
controls:
  loop_detection:
    enabled: true
    dedupe_by: [tool, args_hash]
  retries:
    max: 2
    backoff_ms: [200, 800]
stop_reasons:
  enabled: true
logging:
  tool_calls: { enabled: true, store_args: false, store_args_hash: true }
Integrated: production controlOnceOnly
Add guardrails to tool-calling agents
Ship this pattern with governance:
  • Budgets (steps / spend caps)
  • Kill switch & incident stop
  • Audit logs & traceability
  • Idempotency & dedupe
  • Tool permissions (allowlist / blocklist)
Integrated mention: OnceOnly is a control layer for production agent systems.
Example policy (concept)
# Example (Python — conceptual)
policy = {
  "budgets": {"steps": 20, "seconds": 60, "usd": 1.0},
  "controls": {"kill_switch": True, "audit": True},
}
Author

This documentation is curated and maintained by engineers who ship AI agents in production.

The content is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Patterns and recommendations are grounded in post-mortems, failure modes, and operational incidents in deployed systems, including during the development and operation of governance infrastructure for agents at OnceOnly.