Idea In 30 Seconds
Tool mocking and fault injection let you reproduce API failures in a controlled way and verify how an agent handles them, without real network calls and without non-deterministic noise.
The core value is controlled reproduction of timeout, 5xx, or broken responses, with explicit checks for retry, fallback, and stop reason.
Problem
Without mocks and fault injection, teams usually see only happy-path behavior:
- tool responds fast;
- response is valid;
- agent finishes run without errors.
In production, this is rare. Tools can return timeouts, partial failures, empty fields, or unstable latency.
Without dedicated failure tests, this usually leads to:
- unpredictable failures in critical scenarios;
- infinite call retries;
- expensive and noisy incidents that are hard to reproduce.
When To Use
This approach is needed when an agent relies on external tools:
- payments API, CRM, search, backend services;
- tools with retry/backoff logic;
- scenarios where correct
stop_reasonis critical; - fallback scenarios (for example, backup tool or safe response).
If a tool failure can be modeled locally, it is a good candidate for a fault-injection test.
Implementation
In practice, this follows a simple rule: one failure type, one test, controlled conditions. Examples below are schematic and not tied to a specific framework.
How It Works In One Test
Short fault-injection test cycle
- Test case - one behavior to validate.
- Mock tool - lock input/output contract.
- Inject fault - apply a specific failure (
timeout,5xx,bad_payload). - Run - execute a concrete agent step.
- Assertions - verify retry, fallback,
stop_reason, and error format.
1. Lock mock-tool contract
class FakePaymentsAPI:
def __init__(self, mode: str = "ok"):
self.mode = mode
def refund(self, order_id: str):
if self.mode == "ok":
return {"status": "approved", "order_id": order_id}
if self.mode == "timeout":
raise TimeoutError("payments_timeout")
if self.mode == "http_500":
raise RuntimeError("payments_500")
return {"status": None}
Mock should reproduce real tool contract as closely as possible. Otherwise, tests create false confidence.
2. Inject failures in a controlled way
def test_timeout_fault_is_injected():
payments = FakePaymentsAPI(mode="timeout")
agent = Agent(payments_api=payments)
result = agent.handle_refund("order-8472")
assert result.stop_reason in {"tool_error_handled", "fallback_used"}
Failure profile must be explicit and repeatable: the same test should always reproduce the same failure shape.
3. Verify retry and fallback
def test_retry_then_fallback():
payments = FlakyPaymentsAPI(fail_times=2, then="timeout")
backup = FakeBackupTool()
agent = Agent(payments_api=payments, backup_tool=backup, max_retries=2)
result = agent.handle_refund("order-9001")
assert payments.calls == 2
assert result.selected_tool == "backup_tool"
assert result.stop_reason == "fallback_used"
You should verify not only that an error happened, but also recovery policy after error.
For retry flows, verify not only retry count, but also conditions when system stops retrying and moves to fallback or fail.
For tools with side effects, verify that retry does not create duplicate operations.
4. Lock error structure
def test_error_envelope_is_stable():
payments = FakePaymentsAPI(mode="http_500")
agent = Agent(payments_api=payments)
result = agent.handle_refund("order-1122")
assert result.error["code"] == "tool_error"
assert result.error["tool"] == "payments_api"
assert result.error["retryable"] is True
Stable error format simplifies debugging, alerting, and regression checks.
5. Run these tests in CI
These tests should run in every PR via the standard pytest step in CI when changes touch tool logic, retries, or fallback rules.
Typical Mistakes
Mock does not match real contract
Test passes, but in production agent fails because field structure or error code is different.
Typical cause: mock returns oversimplified payload that does not resemble real API.
Testing only happy-path
Tests include only "successful response", without timeout, 5xx, and invalid payload.
Typical cause: no required-failure-profile list for each critical tool.
Random fault injection
The same test sometimes passes and sometimes fails.
Typical cause: random failures without fixed seed or unstable timeouts.
No checks for stop_reason and error shape
Team checks only final response text, while recovery behavior remains untested.
Typical cause: missing structural assertions for stop_reason, error.code, selected_tool.
No side-effect checks during retry
Retry formally handles failure, but creates duplicate operation or duplicate write.
Typical cause: tests cover only stop_reason and fallback, but not idempotency of tool layer.
Mixing unit and integration checks
Test is called unit, but it calls a real API.
Typical cause: no boundary between local tests (mocks/fault injection) and integration layer.
Summary
- Tool mocking and fault injection validate how agent handles tool failures.
- One failure type should be covered by one deterministic test.
- Validate not only text, but retry, fallback,
stop_reason, and error format. - Critical fault tests should run in every PR.
FAQ
Q: Can we test failures without real API?
A: Yes. At unit level this is standard: fakes and mocks provide stable, reproducible signal.
Q: What is more important: retry or fallback?
A: Both. Retry covers short failures; fallback protects scenario when primary tool remains unavailable.
Q: How many failure profiles per tool should we have?
A: Minimum three: timeout, server error (5xx), and invalid payload.
Q: Does this replace eval harness and regression?
A: No. These tests cover local tool-layer behavior. System behavior on complete scenarios is checked via eval harness and regression.
What Next
Add fault cases to Eval Harness and lock them in Golden Datasets. For version-to-version control, add Regression Testing, and analyze incidents with Replay and Debugging.