Normal path: execute → tool → observe.
Problem-first intro
You ship an agent.
It costs “a few cents” in testing.
Then it hits production traffic and someone posts in Slack:
“Why did we spend $900 on the agent yesterday?”
Budget explosions are rarely one big bug. They’re death by a thousand cuts:
- token usage drifts up
- retries multiply
- tool calls become loops
- prompts get bigger “just this once”
If you don’t measure and cap budgets, you’ll learn about spend from finance. Finance is not a monitoring system.
Quick take
- Budgets leak via prompt bloat + retries + tool spam, not one big “bug”.
- Cap time, steps, tool calls, and spend per run, and always return a stop reason.
- Track tokens + tool calls + estimated cost per run so you can alert before finance does.
Why this fails in production
Costs compound in agent systems.
1) Tokens scale with context, not with intent
Intent: “summarize this”. Implementation: “paste the last 40 messages + 6 tool outputs + 2 runbooks”.
Token costs scale with what you feed the model, not what the user asked.
2) Retries multiply cost
If a model call fails and you retry:
- you pay twice
- you add latency
If a tool call fails and you retry:
- you pay in tool costs
- and you often pay in more model tokens because you explain the failure
Retries are not free. In agent loops they’re multiplicative.
3) “Planning” is pure overhead
Planning-heavy agents burn tokens before doing anything useful. That’s fine when it prevents tool spam. It’s not fine when it’s just “more thinking”.
4) Tool spam makes budgets meaningless
If you don’t cap tool calls, the agent can spend $0.01 on model tokens and $5 on tools. Your “token budget” didn’t protect you. Because it wasn’t the budget you needed.
5) You don’t know spend unless you log it
If your logs don’t include:
- model tokens in/out
- tool calls count
- per-run cost estimate
- stop reason
…you can’t alert on spend drift.
Implementation example (real code)
This is a minimal per-run budget tracker:
- stops on time, steps, tool calls
- estimates cost (roughly) and stops on spend
- returns a stop reason you can alert on
from dataclasses import dataclass
import time
@dataclass(frozen=True)
class Budget:
max_steps: int = 25
max_seconds: int = 60
max_tool_calls: int = 12
max_usd: float = 1.00
@dataclass
class Usage:
tool_calls: int = 0
model_tokens_in: int = 0
model_tokens_out: int = 0
estimated_usd: float = 0.0
class BudgetExceeded(RuntimeError):
pass
def estimate_usd(tokens_in: int, tokens_out: int) -> float:
# Replace with real pricing for your model(s).
# This is a placeholder to show the pattern; pricing varies by provider and model.
return (tokens_in + tokens_out) * 0.000002 # $/token (placeholder)
class BudgetGuard:
def __init__(self, budget: Budget) -> None:
self.budget = budget
self.usage = Usage()
self.started = time.time()
self.steps = 0
def check_step(self) -> None:
self.steps += 1
if self.steps > self.budget.max_steps:
raise BudgetExceeded("step budget exceeded")
if time.time() - self.started > self.budget.max_seconds:
raise BudgetExceeded("time budget exceeded")
def on_tool_call(self) -> None:
self.usage.tool_calls += 1
if self.usage.tool_calls > self.budget.max_tool_calls:
raise BudgetExceeded("tool budget exceeded")
def on_model_call(self, *, tokens_in: int, tokens_out: int) -> None:
self.usage.model_tokens_in += tokens_in
self.usage.model_tokens_out += tokens_out
self.usage.estimated_usd = estimate_usd(
self.usage.model_tokens_in, self.usage.model_tokens_out
)
if self.usage.estimated_usd > self.budget.max_usd:
raise BudgetExceeded("cost budget exceeded")
def run(task: str, *, budget: Budget) -> str:
guard = BudgetGuard(budget)
while True:
guard.check_step()
# model call (pseudo)
action, tokens_in, tokens_out = llm_decide(task) # (pseudo)
guard.on_model_call(tokens_in=tokens_in, tokens_out=tokens_out)
if action.kind == "tool":
guard.on_tool_call()
result = call_tool(action.name, action.args) # (pseudo)
task = update_state(task, action, result) # (pseudo)
else:
return action.final_answerexport class BudgetExceeded extends Error {}
export class BudgetGuard {
constructor(budget) {
this.budget = budget;
this.started = Date.now();
this.steps = 0;
this.usage = { toolCalls: 0, tokensIn: 0, tokensOut: 0, estimatedUsd: 0 };
}
estimateUsd(tokensIn, tokensOut) {
// Replace with real pricing for your provider/model.
// This is a placeholder to show the pattern.
return (tokensIn + tokensOut) * 0.000002;
}
checkStep() {
this.steps += 1;
const elapsedS = (Date.now() - this.started) / 1000;
if (this.steps > this.budget.maxSteps) throw new BudgetExceeded("step budget exceeded");
if (elapsedS > this.budget.maxSeconds) throw new BudgetExceeded("time budget exceeded");
}
onToolCall() {
this.usage.toolCalls += 1;
if (this.usage.toolCalls > this.budget.maxToolCalls) throw new BudgetExceeded("tool budget exceeded");
}
onModelCall({ tokensIn, tokensOut }) {
this.usage.tokensIn += tokensIn;
this.usage.tokensOut += tokensOut;
this.usage.estimatedUsd = this.estimateUsd(this.usage.tokensIn, this.usage.tokensOut);
if (this.usage.estimatedUsd > this.budget.maxUsd) throw new BudgetExceeded("cost budget exceeded");
}
}
export function run(task, { budget }) {
const guard = new BudgetGuard(budget);
while (true) {
guard.checkStep();
// model call (pseudo)
const { action, tokensIn, tokensOut } = llmDecide(task); // (pseudo)
guard.onModelCall({ tokensIn, tokensOut });
if (action.kind === "tool") {
guard.onToolCall();
const result = callTool(action.name, action.args); // (pseudo)
task = updateState(task, action, result); // (pseudo)
} else {
return action.final_answer;
}
}
}The key detail: budgets are checked continuously, not just at the end. You want to stop before you hit the cliff.
Example failure case (incident-style, numbers are illustrative)
We had an agent that ran fine in dev at ~3k tokens/request.
Then we added “helpful context”:
- last 20 user messages
- full tool outputs (including HTML)
- a runbook snippet
Prompt size drifted. Nobody noticed.
Impact over 48 hours (example numbers):
- median tokens/request: 3k → 16k
- p95 latency: 2.4s → 8.9s
- spend: +$740 vs baseline
Fix:
- hard budgets (tokens, tool calls, time, spend)
- prompt builder with caps + summarization
- alerting on tokens/request and spend/run
- safe-mode fallback when budgets hit
This wasn’t “the model got worse”. We fed it more and hoped the bill wouldn’t notice.
Trade-offs
- Tight budgets increase “stopped early” responses. That’s fine — better than runaway spend.
- Spend estimation is approximate. It doesn’t need to be perfect to be useful.
- Summaries save tokens but can lose nuance. Use them where it’s safe.
When NOT to use
- If you can’t estimate cost at all (multiple models/tools), start with time/tool budgets first.
- If the workload is deterministic, a workflow with fixed costs is a better choice.
- If you need long-context reasoning, plan for a bigger budget and make it explicit.
Copy-paste checklist
- [ ] Budgets: steps, tool calls, seconds, USD
- [ ] Track tokens in/out per run
- [ ] Estimate spend per run and alert on spikes
- [ ] Cap retries (model + tool)
- [ ] Cap untrusted text size (HTML/tool dumps)
- [ ] Summarize or truncate over-budget context
- [ ] Return a stop reason (don’t silently timeout)
Safe default config snippet (JSON/YAML)
budgets:
max_steps: 25
max_seconds: 60
max_tool_calls: 12
max_usd: 1.0
llm:
retries: { max_attempts: 2 }
context:
max_prompt_tokens: 2500
summarize_when_over_budget: true
FAQ (3–5)
Used by patterns
Related failures
Related pages (3–6 links)
- Foundations: How LLM limits affect agents · What makes an agent production-ready
- Failure: Tool spam loops · Infinite loop
- Governance: Tool permissions (allowlists)
- Production stack: Production agent stack