Token Overuse Incidents (Prompt Bloat) + Fixes + Code

Spot the failure early before the bill climbs.
Learn what breaks in production and why.
Copy guardrails: budgets, stop reasons, validation.
Know when this isn’t the real root cause.

Detection signals

Tool calls per run spikes (or repeats with same args hash).
Spend or tokens per request climbs without better outputs.
Retries shift from rare to constant (429/5xx).

Prompt bloat is a production incident: latency spikes, cost spikes, and truncation that drops your policy. Here’s how token overuse happens and how to budget context safely.

On this page

Quick take
Problem-first intro
Why this fails in production
1) Context grows by default
2) Prompt bloat causes truncation
3) Tool outputs are token bombs
4) Memory makes it worse
5) You can’t fix this with “just use a bigger context model”
Implementation example (real code)
Example incident (numbers are illustrative)
Trade-offs
When NOT to use
Copy-paste checklist
Safe default config snippet (JSON/YAML)
FAQ (3–5)
Related pages (3–6 links)

Interactive flow

Scenario:

Step 1/2: Execution

Normal path: execute → tool → observe.

Quick take

If you don’t cap context, it grows. Token overuse is the default.
Prompt bloat causes truncation (policy dies first) and latency/cost spikes.
Treat tool outputs (HTML/logs) as token bombs: extract → summarize → budget.
Use a context builder with a hard budget (chars now, tokens soon).

Problem-first intro

Everything is fine.

Then your agent gets “smarter”.

It starts including:

more history
more tool output
more “helpful context”

Latency creeps up. Costs creep up. And then one day policy gets truncated and the agent starts acting weird.

Token overuse isn’t a theoretical problem. It’s a production incident you can graph.

Why this fails in production

1) Context grows by default

If you don’t cap it, context grows:

chat history grows
memory grows
tool outputs grow (especially HTML)

The model doesn’t know your budgets. It will happily eat everything you feed it.

2) Prompt bloat causes truncation

Large prompts increase the chance that:

the top of your prompt is dropped
the “system/policy” portion is the first casualty

When policy is missing, the model doesn’t suddenly become evil. It becomes unconstrained.

3) Tool outputs are token bombs

Raw HTML, logs, and stack traces are huge. If you paste them verbatim:

tokens explode
and the model often can’t even use the content effectively

Extract → summarize → include a small slice.

4) Memory makes it worse

Memory systems that “store everything” tend to “show everything”. That’s not memory. That’s prompt inflation.

5) You can’t fix this with “just use a bigger context model”

Bigger context windows:

cost more
are slower
and still truncate (just later)

The fix is a context budget and a context builder.

Diagram

Store everything, show a budgeted slice

Implementation example (real code)

This is a cheap, practical prompt budgeter:

caps total context
keeps the newest items
summarises (placeholder) when over budget

Use characters here; in production you’ll want token counting.

PythonJS

PYTHON

from dataclasses import dataclass
from typing import Iterable


@dataclass(frozen=True)
class ContextBudget:
  max_chars: int = 12_000


def summarize(text: str) -> str:
  # In real code: a dedicated summarization step with its own budget.
  return text[:2000] + "…"


def build_context(chunks: Iterable[str], *, budget: ContextBudget) -> str:
  parts: list[str] = []
  total = 0
  for c in chunks:
      parts.append(c)
      total += len(c)

  ctx = "\n\n".join(parts)
  if len(ctx) <= budget.max_chars:
      return ctx

  # Over budget: summarize oldest parts first.
  # This is simplistic, but it prevents unbounded growth.
  over = len(ctx) - budget.max_chars
  head = ctx[: over + 1000]
  tail = ctx[over + 1000 :]
  return summarize(head) + "\n\n" + tail

JAVASCRIPT

export function summarize(text) {
// Real code: dedicated summarization with its own budget.
return text.slice(0, 2000) + "…";
}

export function buildContext(chunks, { maxChars = 12_000 } = {}) {
const ctx = chunks.join("\\n\\n");
if (ctx.length <= maxChars) return ctx;

// Summarize the oldest part.
const over = ctx.length - maxChars;
const head = ctx.slice(0, over + 1000);
const tail = ctx.slice(over + 1000);
return summarize(head) + "\\n\\n" + tail;
}

This won’t give you perfect “memory”. It gives you the one thing you need first: a budget that prevents runaway prompts.

Example incident (numbers are illustrative)

Example: an agent that answered “why did this job fail?” questions. It included the full stack trace and logs in the prompt.

It worked, until a customer pasted a 2MB log blob.

Impact:

tokens/request spiked from ~4k → 45k
p95 latency spiked from 3.2s → 19s
spend spiked by ~$520 in one day
worst part: prompts truncated policy text, leading to unsafe tool suggestions

Fix:

input caps (max chars) on user-provided logs
extract structured fields from logs (errors, timestamps) instead of raw dumps
context budgeter + summarization tier
metrics + alerts on tokens/request

Logs are useful. Raw logs are not a prompt format.

Trade-offs

Summaries lose details (but details in raw dumps weren’t helping anyway).
Strict caps can reject some “power user” requests. Offer async uploads instead.
Token counting adds complexity. It pays for itself quickly at scale.

When NOT to use

If you need exact reasoning over long documents, an agent loop may be the wrong tool. Use targeted retrieval + workflows.
If you can’t safely summarize untrusted text, don’t include it wholesale.
If you can’t measure tokens, start with char budgets today and fix tokens next.

Copy-paste checklist

[ ] Cap user-provided text size (logs, HTML, PDFs)
[ ] Cap tool output size before it enters context
[ ] Context builder with a hard budget (tokens/chars)
[ ] Summarization tier with its own budget
[ ] Repeat critical policy constraints every turn (survives truncation)
[ ] Metrics: tokens/request, latency, spend/run
[ ] Alerts on spikes and drift

Safe default config snippet (JSON/YAML)

YAML

context:
  max_prompt_tokens: 2500
  max_untrusted_chars: 8000
  summarize_when_over_budget: true
policy:
  repeat_critical_constraints_every_turn: true
metrics:
  track: ["tokens_per_request", "latency_p95", "spend_per_run"]

FAQ (3–5)

Used by patterns

Related failures

Governance required

Can’t I just buy a bigger context model?

You can, but you’ll pay in latency and cost, and you’ll still truncate eventually. Budgeting is cheaper.

How do I count tokens accurately?

Use your provider’s tokenizer. If you can’t, start with char caps and add token counting as a follow-up.

Should I store everything in memory?

Store events, yes. Show everything to the model, no. Memory ≠ prompt size.

Why repeat policy constraints?

Because truncation kills the top of the prompt. Repetition keeps constraints alive.

Foundations: Agent memory types · How LLM limits affect agents
Failure: Budget explosion · Hallucinated sources
Governance: Tool permissions (allowlists)
Production stack: Production agent stack

Not sure this is your use case?

Design your agent ->