Tool Mocking and Fault Injection for AI Agents

Idea In 30 Seconds

Tool mocking and fault injection let you reproduce API failures in a controlled way and verify how an agent handles them, without real network calls and without non-deterministic noise.

The core value is controlled reproduction of timeout, 5xx, or broken responses, with explicit checks for retry, fallback, and stop reason.

Problem

Without mocks and fault injection, teams usually see only happy-path behavior:

tool responds fast;
response is valid;
agent finishes run without errors.

In production, this is rare. Tools can return timeouts, partial failures, empty fields, or unstable latency.

Without dedicated failure tests, this usually leads to:

unpredictable failures in critical scenarios;
infinite call retries;
expensive and noisy incidents that are hard to reproduce.

When To Use

This approach is needed when an agent relies on external tools:

payments API, CRM, search, backend services;
tools with retry/backoff logic;
scenarios where correct stop_reason is critical;
fallback scenarios (for example, backup tool or safe response).

If a tool failure can be modeled locally, it is a good candidate for a fault-injection test.

Implementation

In practice, this follows a simple rule: one failure type, one test, controlled conditions. Examples below are schematic and not tied to a specific framework.

How It Works In One Test

Short fault-injection test cycle

Test case - one behavior to validate.
Mock tool - lock input/output contract.
Inject fault - apply a specific failure (timeout, 5xx, bad_payload).
Run - execute a concrete agent step.
Assertions - verify retry, fallback, stop_reason, and error format.

1. Lock mock-tool contract

PYTHON

class FakePaymentsAPI:
    def __init__(self, mode: str = "ok"):
        self.mode = mode

    def refund(self, order_id: str):
        if self.mode == "ok":
            return {"status": "approved", "order_id": order_id}
        if self.mode == "timeout":
            raise TimeoutError("payments_timeout")
        if self.mode == "http_500":
            raise RuntimeError("payments_500")
        return {"status": None}

Mock should reproduce real tool contract as closely as possible. Otherwise, tests create false confidence.

2. Inject failures in a controlled way

PYTHON

def test_timeout_fault_is_injected():
    payments = FakePaymentsAPI(mode="timeout")
    agent = Agent(payments_api=payments)

    result = agent.handle_refund("order-8472")

    assert result.stop_reason in {"tool_error_handled", "fallback_used"}

Failure profile must be explicit and repeatable: the same test should always reproduce the same failure shape.

3. Verify retry and fallback

PYTHON

def test_retry_then_fallback():
    payments = FlakyPaymentsAPI(fail_times=2, then="timeout")
    backup = FakeBackupTool()
    agent = Agent(payments_api=payments, backup_tool=backup, max_retries=2)

    result = agent.handle_refund("order-9001")

    assert payments.calls == 2
    assert result.selected_tool == "backup_tool"
    assert result.stop_reason == "fallback_used"

You should verify not only that an error happened, but also recovery policy after error.

For retry flows, verify not only retry count, but also conditions when system stops retrying and moves to fallback or fail.

For tools with side effects, verify that retry does not create duplicate operations.

4. Lock error structure

PYTHON

def test_error_envelope_is_stable():
    payments = FakePaymentsAPI(mode="http_500")
    agent = Agent(payments_api=payments)
    result = agent.handle_refund("order-1122")

    assert result.error["code"] == "tool_error"
    assert result.error["tool"] == "payments_api"
    assert result.error["retryable"] is True

Stable error format simplifies debugging, alerting, and regression checks.

5. Run these tests in CI

These tests should run in every PR via the standard pytest step in CI when changes touch tool logic, retries, or fallback rules.

Typical Mistakes

Mock does not match real contract

Test passes, but in production agent fails because field structure or error code is different.

Typical cause: mock returns oversimplified payload that does not resemble real API.

Testing only happy-path

Tests include only "successful response", without timeout, 5xx, and invalid payload.

Typical cause: no required-failure-profile list for each critical tool.

Random fault injection

The same test sometimes passes and sometimes fails.

Typical cause: random failures without fixed seed or unstable timeouts.

No checks for stop_reason and error shape

Team checks only final response text, while recovery behavior remains untested.

Typical cause: missing structural assertions for stop_reason, error.code, selected_tool.

No side-effect checks during retry

Retry formally handles failure, but creates duplicate operation or duplicate write.

Typical cause: tests cover only stop_reason and fallback, but not idempotency of tool layer.

Mixing unit and integration checks

Test is called unit, but it calls a real API.

Typical cause: no boundary between local tests (mocks/fault injection) and integration layer.

Summary

Quick take

Tool mocking and fault injection validate how agent handles tool failures.
One failure type should be covered by one deterministic test.
Validate not only text, but retry, fallback, stop_reason, and error format.
Critical fault tests should run in every PR.

FAQ

Q: Can we test failures without real API?
A: Yes. At unit level this is standard: fakes and mocks provide stable, reproducible signal.

Q: What is more important: retry or fallback?
A: Both. Retry covers short failures; fallback protects scenario when primary tool remains unavailable.

Q: How many failure profiles per tool should we have?
A: Minimum three: timeout, server error (5xx), and invalid payload.

Q: Does this replace eval harness and regression?
A: No. These tests cover local tool-layer behavior. System behavior on complete scenarios is checked via eval harness and regression.

What Next

Add fault cases to Eval Harness and lock them in Golden Datasets. For version-to-version control, add Regression Testing, and analyze incidents with Replay and Debugging.

Tool Mocking and Fault Injection for AI Agents

Idea In 30 Seconds

Problem

When To Use

Implementation

How It Works In One Test

1. Lock mock-tool contract

2. Inject failures in a controlled way

3. Verify retry and fallback

4. Lock error structure

5. Run these tests in CI

Typical Mistakes

Mock does not match real contract

Testing only happy-path

Random fault injection

No checks for stop_reason and error shape

No side-effect checks during retry

Mixing unit and integration checks

Summary

FAQ

What Next

Used by patterns

Related failures

Governance required

Author

Editorial note