Token Overuse Incidents (Prompt Bloat) + Fixes + Code

  • Spot the failure early before the bill climbs.
  • Learn what breaks in production and why.
  • Copy guardrails: budgets, stop reasons, validation.
  • Know when this isn’t the real root cause.
Detection signals
  • Tool calls per run spikes (or repeats with same args hash).
  • Spend or tokens per request climbs without better outputs.
  • Retries shift from rare to constant (429/5xx).
Prompt bloat is a production incident: latency spikes, cost spikes, and truncation that drops your policy. Here’s how token overuse happens and how to budget context safely.
On this page
  1. Quick take
  2. Problem-first intro
  3. Why this fails in production
  4. 1) Context grows by default
  5. 2) Prompt bloat causes truncation
  6. 3) Tool outputs are token bombs
  7. 4) Memory makes it worse
  8. 5) You can’t fix this with “just use a bigger context model”
  9. Implementation example (real code)
  10. Example incident (numbers are illustrative)
  11. Trade-offs
  12. When NOT to use
  13. Copy-paste checklist
  14. Safe default config snippet (JSON/YAML)
  15. FAQ (3–5)
  16. Related pages (3–6 links)
Interactive flow
Scenario:
Step 1/2: Execution

Normal path: execute → tool → observe.

Quick take

  • If you don’t cap context, it grows. Token overuse is the default.
  • Prompt bloat causes truncation (policy dies first) and latency/cost spikes.
  • Treat tool outputs (HTML/logs) as token bombs: extract → summarize → budget.
  • Use a context builder with a hard budget (chars now, tokens soon).

Problem-first intro

Everything is fine.

Then your agent gets “smarter”.

It starts including:

  • more history
  • more tool output
  • more “helpful context”

Latency creeps up. Costs creep up. And then one day policy gets truncated and the agent starts acting weird.

Token overuse isn’t a theoretical problem. It’s a production incident you can graph.

Why this fails in production

1) Context grows by default

If you don’t cap it, context grows:

  • chat history grows
  • memory grows
  • tool outputs grow (especially HTML)

The model doesn’t know your budgets. It will happily eat everything you feed it.

2) Prompt bloat causes truncation

Large prompts increase the chance that:

  • the top of your prompt is dropped
  • the “system/policy” portion is the first casualty

When policy is missing, the model doesn’t suddenly become evil. It becomes unconstrained.

3) Tool outputs are token bombs

Raw HTML, logs, and stack traces are huge. If you paste them verbatim:

  • tokens explode
  • and the model often can’t even use the content effectively

Extract → summarize → include a small slice.

4) Memory makes it worse

Memory systems that “store everything” tend to “show everything”. That’s not memory. That’s prompt inflation.

5) You can’t fix this with “just use a bigger context model”

Bigger context windows:

  • cost more
  • are slower
  • and still truncate (just later)

The fix is a context budget and a context builder.

Diagram
Store everything, show a budgeted slice

Implementation example (real code)

This is a cheap, practical prompt budgeter:

  • caps total context
  • keeps the newest items
  • summarises (placeholder) when over budget

Use characters here; in production you’ll want token counting.

PYTHON
from dataclasses import dataclass
from typing import Iterable


@dataclass(frozen=True)
class ContextBudget:
  max_chars: int = 12_000


def summarize(text: str) -> str:
  # In real code: a dedicated summarization step with its own budget.
  return text[:2000] + "…"


def build_context(chunks: Iterable[str], *, budget: ContextBudget) -> str:
  parts: list[str] = []
  total = 0
  for c in chunks:
      parts.append(c)
      total += len(c)

  ctx = "\n\n".join(parts)
  if len(ctx) <= budget.max_chars:
      return ctx

  # Over budget: summarize oldest parts first.
  # This is simplistic, but it prevents unbounded growth.
  over = len(ctx) - budget.max_chars
  head = ctx[: over + 1000]
  tail = ctx[over + 1000 :]
  return summarize(head) + "\n\n" + tail
JAVASCRIPT
export function summarize(text) {
// Real code: dedicated summarization with its own budget.
return text.slice(0, 2000) + "…";
}

export function buildContext(chunks, { maxChars = 12_000 } = {}) {
const ctx = chunks.join("\\n\\n");
if (ctx.length <= maxChars) return ctx;

// Summarize the oldest part.
const over = ctx.length - maxChars;
const head = ctx.slice(0, over + 1000);
const tail = ctx.slice(over + 1000);
return summarize(head) + "\\n\\n" + tail;
}

This won’t give you perfect “memory”. It gives you the one thing you need first: a budget that prevents runaway prompts.

Example incident (numbers are illustrative)

Example: an agent that answered “why did this job fail?” questions. It included the full stack trace and logs in the prompt.

It worked, until a customer pasted a 2MB log blob.

Impact:

  • tokens/request spiked from ~4k → 45k
  • p95 latency spiked from 3.2s → 19s
  • spend spiked by ~$520 in one day
  • worst part: prompts truncated policy text, leading to unsafe tool suggestions

Fix:

  1. input caps (max chars) on user-provided logs
  2. extract structured fields from logs (errors, timestamps) instead of raw dumps
  3. context budgeter + summarization tier
  4. metrics + alerts on tokens/request

Logs are useful. Raw logs are not a prompt format.

Trade-offs

  • Summaries lose details (but details in raw dumps weren’t helping anyway).
  • Strict caps can reject some “power user” requests. Offer async uploads instead.
  • Token counting adds complexity. It pays for itself quickly at scale.

When NOT to use

  • If you need exact reasoning over long documents, an agent loop may be the wrong tool. Use targeted retrieval + workflows.
  • If you can’t safely summarize untrusted text, don’t include it wholesale.
  • If you can’t measure tokens, start with char budgets today and fix tokens next.

Copy-paste checklist

  • [ ] Cap user-provided text size (logs, HTML, PDFs)
  • [ ] Cap tool output size before it enters context
  • [ ] Context builder with a hard budget (tokens/chars)
  • [ ] Summarization tier with its own budget
  • [ ] Repeat critical policy constraints every turn (survives truncation)
  • [ ] Metrics: tokens/request, latency, spend/run
  • [ ] Alerts on spikes and drift

Safe default config snippet (JSON/YAML)

YAML
context:
  max_prompt_tokens: 2500
  max_untrusted_chars: 8000
  summarize_when_over_budget: true
policy:
  repeat_critical_constraints_every_turn: true
metrics:
  track: ["tokens_per_request", "latency_p95", "spend_per_run"]

FAQ (3–5)

Can’t I just buy a bigger context model?
You can, but you’ll pay in latency and cost, and you’ll still truncate eventually. Budgeting is cheaper.
How do I count tokens accurately?
Use your provider’s tokenizer. If you can’t, start with char caps and add token counting as a follow-up.
Should I store everything in memory?
Store events, yes. Show everything to the model, no. Memory ≠ prompt size.
Why repeat policy constraints?
Because truncation kills the top of the prompt. Repetition keeps constraints alive.

Not sure this is your use case?

Design your agent ->
⏱️ 6 min readUpdated Mar, 2026Difficulty: ★★☆
Implement in OnceOnly
Guardrails for loops, retries, and spend escalation.
Use in OnceOnly
# onceonly guardrails (concept)
version: 1
budgets:
  max_steps: 25
  max_tool_calls: 12
  max_seconds: 60
  max_usd: 1.00
policy:
  tool_allowlist:
    - search.read
    - http.get
controls:
  loop_detection:
    enabled: true
    dedupe_by: [tool, args_hash]
  retries:
    max: 2
    backoff_ms: [200, 800]
stop_reasons:
  enabled: true
logging:
  tool_calls: { enabled: true, store_args: false, store_args_hash: true }
Integrated: production controlOnceOnly
Add guardrails to tool-calling agents
Ship this pattern with governance:
  • Budgets (steps / spend caps)
  • Kill switch & incident stop
  • Audit logs & traceability
  • Idempotency & dedupe
  • Tool permissions (allowlist / blocklist)
Integrated mention: OnceOnly is a control layer for production agent systems.
Example policy (concept)
# Example (Python — conceptual)
policy = {
  "budgets": {"steps": 20, "seconds": 60, "usd": 1.0},
  "controls": {"kill_switch": True, "audit": True},
}
Author

This documentation is curated and maintained by engineers who ship AI agents in production.

The content is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Patterns and recommendations are grounded in post-mortems, failure modes, and operational incidents in deployed systems, including during the development and operation of governance infrastructure for agents at OnceOnly.