Normal path: execute → tool → observe.
Quick take
- If you don’t cap context, it grows. Token overuse is the default.
- Prompt bloat causes truncation (policy dies first) and latency/cost spikes.
- Treat tool outputs (HTML/logs) as token bombs: extract → summarize → budget.
- Use a context builder with a hard budget (chars now, tokens soon).
Problem-first intro
Everything is fine.
Then your agent gets “smarter”.
It starts including:
- more history
- more tool output
- more “helpful context”
Latency creeps up. Costs creep up. And then one day policy gets truncated and the agent starts acting weird.
Token overuse isn’t a theoretical problem. It’s a production incident you can graph.
Why this fails in production
1) Context grows by default
If you don’t cap it, context grows:
- chat history grows
- memory grows
- tool outputs grow (especially HTML)
The model doesn’t know your budgets. It will happily eat everything you feed it.
2) Prompt bloat causes truncation
Large prompts increase the chance that:
- the top of your prompt is dropped
- the “system/policy” portion is the first casualty
When policy is missing, the model doesn’t suddenly become evil. It becomes unconstrained.
3) Tool outputs are token bombs
Raw HTML, logs, and stack traces are huge. If you paste them verbatim:
- tokens explode
- and the model often can’t even use the content effectively
Extract → summarize → include a small slice.
4) Memory makes it worse
Memory systems that “store everything” tend to “show everything”. That’s not memory. That’s prompt inflation.
5) You can’t fix this with “just use a bigger context model”
Bigger context windows:
- cost more
- are slower
- and still truncate (just later)
The fix is a context budget and a context builder.
Implementation example (real code)
This is a cheap, practical prompt budgeter:
- caps total context
- keeps the newest items
- summarises (placeholder) when over budget
Use characters here; in production you’ll want token counting.
from dataclasses import dataclass
from typing import Iterable
@dataclass(frozen=True)
class ContextBudget:
max_chars: int = 12_000
def summarize(text: str) -> str:
# In real code: a dedicated summarization step with its own budget.
return text[:2000] + "…"
def build_context(chunks: Iterable[str], *, budget: ContextBudget) -> str:
parts: list[str] = []
total = 0
for c in chunks:
parts.append(c)
total += len(c)
ctx = "\n\n".join(parts)
if len(ctx) <= budget.max_chars:
return ctx
# Over budget: summarize oldest parts first.
# This is simplistic, but it prevents unbounded growth.
over = len(ctx) - budget.max_chars
head = ctx[: over + 1000]
tail = ctx[over + 1000 :]
return summarize(head) + "\n\n" + tailexport function summarize(text) {
// Real code: dedicated summarization with its own budget.
return text.slice(0, 2000) + "…";
}
export function buildContext(chunks, { maxChars = 12_000 } = {}) {
const ctx = chunks.join("\\n\\n");
if (ctx.length <= maxChars) return ctx;
// Summarize the oldest part.
const over = ctx.length - maxChars;
const head = ctx.slice(0, over + 1000);
const tail = ctx.slice(over + 1000);
return summarize(head) + "\\n\\n" + tail;
}This won’t give you perfect “memory”. It gives you the one thing you need first: a budget that prevents runaway prompts.
Example incident (numbers are illustrative)
Example: an agent that answered “why did this job fail?” questions. It included the full stack trace and logs in the prompt.
It worked, until a customer pasted a 2MB log blob.
Impact:
- tokens/request spiked from ~4k → 45k
- p95 latency spiked from 3.2s → 19s
- spend spiked by ~$520 in one day
- worst part: prompts truncated policy text, leading to unsafe tool suggestions
Fix:
- input caps (max chars) on user-provided logs
- extract structured fields from logs (errors, timestamps) instead of raw dumps
- context budgeter + summarization tier
- metrics + alerts on tokens/request
Logs are useful. Raw logs are not a prompt format.
Trade-offs
- Summaries lose details (but details in raw dumps weren’t helping anyway).
- Strict caps can reject some “power user” requests. Offer async uploads instead.
- Token counting adds complexity. It pays for itself quickly at scale.
When NOT to use
- If you need exact reasoning over long documents, an agent loop may be the wrong tool. Use targeted retrieval + workflows.
- If you can’t safely summarize untrusted text, don’t include it wholesale.
- If you can’t measure tokens, start with char budgets today and fix tokens next.
Copy-paste checklist
- [ ] Cap user-provided text size (logs, HTML, PDFs)
- [ ] Cap tool output size before it enters context
- [ ] Context builder with a hard budget (tokens/chars)
- [ ] Summarization tier with its own budget
- [ ] Repeat critical policy constraints every turn (survives truncation)
- [ ] Metrics: tokens/request, latency, spend/run
- [ ] Alerts on spikes and drift
Safe default config snippet (JSON/YAML)
context:
max_prompt_tokens: 2500
max_untrusted_chars: 8000
summarize_when_over_budget: true
policy:
repeat_critical_constraints_every_turn: true
metrics:
track: ["tokens_per_request", "latency_p95", "spend_per_run"]
FAQ (3–5)
Used by patterns
Related failures
Related pages (3–6 links)
- Foundations: Agent memory types · How LLM limits affect agents
- Failure: Budget explosion · Hallucinated sources
- Governance: Tool permissions (allowlists)
- Production stack: Production agent stack