EN
Failures & Fixes
How agents fail in the real world, and how to stop the bleeding.
- Silent Agent Drift (Quality Regression) + Detection + Code★★☆Agents don’t fail all at once. They drift via model/tool/prompt changes until you ship a regression to production. Canary, golden tasks, replay, and metrics catch drift early.
- Budget Explosion (When Agents Burn Money) + Fixes + Code★★☆Budgets don’t fail all at once. They leak via retries, prompt bloat, and tool spam. Here’s how budget explosions happen in production and how to cap spend per run.
- Cascading Tool Failures (How Agents Amplify Outages) + Code★★☆When tools degrade, naive retries and agent loops amplify outages. Use circuit breakers, bulkheads, and safe-mode fallbacks so your agent doesn’t DDoS your own dependencies.
- Deadlocks in Multi-Agent Systems (Failure Mode + Fixes + Code)★★☆Agents waiting on agents is distributed deadlock with nicer logs. Here’s how deadlocks happen in production and how leases, timeouts, and orchestration prevent them.
- Hallucinated Sources in AI Agents (Failure Mode + Fixes + Code)★★☆Agents will confidently cite URLs they never fetched. Here’s why it happens in production and how to enforce evidence-backed citations.
- AI Agent Infinite Loop (How to Detect + Fix, With Code)★★☆Your agent is looping. It’s 03:00. The bill is climbing. Here’s what causes loops, what breaks, and the kill-switches we actually use.
- Partial Outage Handling (Agent Failure + Degrade Mode + Code)★★☆Some tools are down, some are up. Agents that keep trying will thrash and burn budgets. Here’s how to degrade safely with partial results and clear stop reasons.
- Prompt Injection Attacks on Agents (Failure + Defenses + Code)★★☆Prompt injection isn’t a jailbreak. It’s untrusted text coming from tools. Here’s how agents get tricked in production and how to put policy in code.
- Tool Response Corruption (Schema Drift + Truncation) + Code★★☆Corrupted or drifting tool outputs turn into wrong actions. Validate outputs, enforce size limits, and fail closed so your agent doesn’t act on garbage.
- Token Overuse Incidents (Prompt Bloat) + Fixes + Code★★☆Prompt bloat is a production incident: latency spikes, cost spikes, and truncation that drops your policy. Here’s how token overuse happens and how to budget context safely.
- Tool Spam Loops (Agent Failure Mode + Fixes + Code)★★☆When an agent keeps calling the same tool over and over, you pay for it. Here’s how tool spam happens in production and how to stop it.
- Why Agents Fail in Production (And How to Prevent It)★★☆Most agent failures aren't mysterious. They're missing budgets, missing policy enforcement, flaky tools, and zero observability. Here's the failure taxonomy we use in production.