Action is proposed as structured data (tool + args).
Problem-first intro
Your agent âworksâ in staging.
Then production traffic hits and you learn two things:
- the agent is a loop, and loops donât stop out of kindness
- finance is not a monitoring system (but they will page you anyway)
Weâve seen the same pattern over and over:
- a flaky tool adds retries
- retries add tool calls
- tool calls add more model tokens (âhereâs what happened⊠try againâ)
- and suddenly your âfew centsâ agent is doing $8â$20 per run
At scale, thatâs not âa bugâ. Thatâs a surprise subscription your CFO didnât sign up for.
Budgets arenât âcost optimizationâ. Theyâre safety controls. They decide what happens when the agent canât finish.
If you donât decide, the agent decides. And the agentâs decision is usually: âone more tryâ.
Why this fails in production
Budget failures are boring. Thatâs why they ship.
1) Teams budget one thing and forget the rest
Common mistake: âwe have a token budgetâ.
Cool. Your agent just spent $0.04 on tokens and $6 on browser automation.
Production budgets need at least:
max_steps(control loop length)max_seconds(wall clock time)max_tool_calls(blast radius)max_usd(the ânopeâ line)
2) Retries are multiplicative in loops
One retry isnât the problem. Retries inside an agent loop (plus tool retries) are a cost multiplier.
3) Budgets without stop reasons are invisible
If the run ends with a timeout, users retry. That creates more runs.
You want explicit stop reasons:
max_secondsmax_tool_callsmax_usdloop_detected
Stop reasons are observability.
4) Budget enforcement scattered across the codebase doesnât work
If budgets are checked:
- sometimes in the agent
- sometimes in the tool wrapper
- sometimes not at all
âŠyou will miss a path.
Put budgets in one choke point: the run loop + tool gateway.
Implementation example (real code)
This is a production-shaped budget guard:
- checks budgets continuously (not âat the endâ)
- tracks model + tool cost (approx is fine)
- throws a typed stop reason you can log + alert on
from dataclasses import dataclass, field
import time
from typing import Any
TOOL_USD = {
"search.read": 0.00,
"http.get": 0.00,
"browser.run": 0.20, # placeholder
}
@dataclass(frozen=True)
class BudgetPolicy:
max_steps: int = 25
max_seconds: int = 60
max_tool_calls: int = 12
max_usd: float = 1.00
@dataclass
class BudgetState:
started_at: float = field(default_factory=time.time)
steps: int = 0
tool_calls: int = 0
tokens_in: int = 0
tokens_out: int = 0
tool_usd: float = 0.0
def elapsed_s(self) -> float:
return time.time() - self.started_at
def estimate_model_usd(tokens_in: int, tokens_out: int) -> float:
# Replace with your real pricing model(s). Approximate is fine for guards.
return (tokens_in + tokens_out) * 0.000002
class BudgetExceeded(RuntimeError):
def __init__(self, stop_reason: str, *, state: BudgetState):
super().__init__(stop_reason)
self.stop_reason = stop_reason
self.state = state
class BudgetGuard:
def __init__(self, policy: BudgetPolicy):
self.policy = policy
self.state = BudgetState()
def total_usd(self) -> float:
return estimate_model_usd(self.state.tokens_in, self.state.tokens_out) + self.state.tool_usd
def check(self) -> None:
if self.state.steps > self.policy.max_steps:
raise BudgetExceeded("max_steps", state=self.state)
if self.state.elapsed_s() > self.policy.max_seconds:
raise BudgetExceeded("max_seconds", state=self.state)
if self.state.tool_calls > self.policy.max_tool_calls:
raise BudgetExceeded("max_tool_calls", state=self.state)
if self.total_usd() > self.policy.max_usd:
raise BudgetExceeded("max_usd", state=self.state)
def on_step(self) -> None:
self.state.steps += 1
self.check()
def on_model_call(self, *, tokens_in: int, tokens_out: int) -> None:
self.state.tokens_in += tokens_in
self.state.tokens_out += tokens_out
self.check()
def on_tool_call(self, *, tool: str) -> None:
self.state.tool_calls += 1
self.state.tool_usd += float(TOOL_USD.get(tool, 0.0))
self.check()
def run_agent(task: str, *, policy: BudgetPolicy) -> dict[str, Any]:
guard = BudgetGuard(policy)
try:
while True:
guard.on_step()
# model decides next action (pseudo)
action, tokens_in, tokens_out = llm_decide(task) # (pseudo)
guard.on_model_call(tokens_in=tokens_in, tokens_out=tokens_out)
if action.kind == "tool":
guard.on_tool_call(tool=action.name)
obs = call_tool(action.name, action.args) # (pseudo)
task = update_state(task, action, obs) # (pseudo)
continue
return {"status": "ok", "answer": action.final_answer, "usage": guard.state.__dict__}
except BudgetExceeded as e:
return {
"status": "stopped",
"stop_reason": e.stop_reason,
"usage": e.state.__dict__,
"partial": "Stopped by budget. Return partial results + a reason users can understand.",
}const TOOL_USD = {
"search.read": 0.0,
"http.get": 0.0,
"browser.run": 0.2, // placeholder
};
export class BudgetExceeded extends Error {
constructor(stopReason, { state }) {
super(stopReason);
this.stopReason = stopReason;
this.state = state;
}
}
export class BudgetGuard {
constructor(policy) {
this.policy = policy;
this.state = {
startedAtMs: Date.now(),
steps: 0,
toolCalls: 0,
tokensIn: 0,
tokensOut: 0,
toolUsd: 0,
};
}
elapsedS() {
return (Date.now() - this.state.startedAtMs) / 1000;
}
estimateModelUsd(tokensIn, tokensOut) {
return (tokensIn + tokensOut) * 0.000002;
}
totalUsd() {
return this.estimateModelUsd(this.state.tokensIn, this.state.tokensOut) + this.state.toolUsd;
}
check() {
if (this.state.steps > this.policy.maxSteps) throw new BudgetExceeded("max_steps", { state: this.state });
if (this.elapsedS() > this.policy.maxSeconds) throw new BudgetExceeded("max_seconds", { state: this.state });
if (this.state.toolCalls > this.policy.maxToolCalls) throw new BudgetExceeded("max_tool_calls", { state: this.state });
if (this.totalUsd() > this.policy.maxUsd) throw new BudgetExceeded("max_usd", { state: this.state });
}
onStep() {
this.state.steps += 1;
this.check();
}
onModelCall({ tokensIn, tokensOut }) {
this.state.tokensIn += tokensIn;
this.state.tokensOut += tokensOut;
this.check();
}
onToolCall({ tool }) {
this.state.toolCalls += 1;
this.state.toolUsd += Number(TOOL_USD[tool] || 0);
this.check();
}
}Real failure case (incident-style, with numbers)
We shipped an internal âsupport helperâ agent. It had a browser tool. No budgets. (Yes, really.)
Then the vendor search endpoint got flaky for ~90 minutes. The agentâs strategy became: âtry again, slightly different queryâ.
Impact:
- tool calls/run: 4 â 31
- median latency: 6s â 58s
- spend: +$1,120 in one afternoon (mostly browser runs)
- on-call time: ~2.5 hours chasing âwhy is support slow?â
Fix:
- hard budgets per run (steps/time/tool calls/USD)
- explicit stop reasons returned to the UI
- alerting on
tool_calls/runandstop_reason=max_usd - a degrade mode: âno browser during vendor incidentsâ
Budgets didnât make the agent smarter. They made it survivable.
Trade-offs
- Tight budgets will stop some legitimate hard cases.
- Cost estimation is approximate (but good enough to stop runaway runs).
- Youâll need escalation paths (approve bigger budgets) for real âlongâ tasks.
When NOT to use
- If the task is deterministic, donât use an agent. Use a workflow with fixed costs.
- If you canât return partial output + stop reasons, budgets will look like random failures.
- If you canât measure anything, start with time/tool-call caps and add cost later.
Copy-paste checklist
- [ ] Enforce budgets in one choke point (loop + tool gateway)
- [ ] Cap: steps, seconds, tool calls, USD
- [ ] Track tokens + tool calls + rough spend
- [ ] Return stop reasons (not silent timeouts)
- [ ] Add budget tiers (default vs approved)
- [ ] Alert on spend spikes +
stop_reasondistribution changes - [ ] Define degrade mode behavior during incidents
Safe default config snippet (JSON/YAML)
budgets:
default:
max_steps: 25
max_seconds: 60
max_tool_calls: 12
max_usd: 1.0
approved:
max_steps: 80
max_seconds: 240
max_tool_calls: 40
max_usd: 8.0
stop_reasons:
return_to_user: true
log: true
alert_on: ["max_usd", "max_seconds", "max_tool_calls"]
FAQ (3â5)
Used by patterns
Related failures
Q: What budget should we start with?
A: Start with time + tool-call caps. Then add USD once you can estimate model/tool costs. The first goal is stopping runaway loops.
Q: Should budgets be hard-fail or degrade?
A: Prefer degrade with a clear stop reason and partial output. Hard-failing trains users to spam retries.
Q: How do we handle tasks that need more budget?
A: Escalate: require approval, run async, or move to a higher budget tier with stricter logging.
Q: Do budgets replace rate limits?
A: No. Rate limits protect dependencies. Budgets protect you from your own loop.
Related pages (3â6 links)
- Foundations: What makes an agent production-ready · How agents use tools
- Failure: Budget explosion · Tool spam loops
- Governance: Step limits · Cost limits
- Production stack: Production agent stack