Tool Mocking and Fault Injection for AI Agents

Mock tools and inject failures to test how AI agents behave when APIs return errors, latency or outages.
On this page
  1. Idea In 30 Seconds
  2. Problem
  3. When To Use
  4. Implementation
  5. How It Works In One Test
  6. 1. Lock mock-tool contract
  7. 2. Inject failures in a controlled way
  8. 3. Verify retry and fallback
  9. 4. Lock error structure
  10. 5. Run these tests in CI
  11. Typical Mistakes
  12. Mock does not match real contract
  13. Testing only happy-path
  14. Random fault injection
  15. No checks for stop_reason and error shape
  16. No side-effect checks during retry
  17. Mixing unit and integration checks
  18. Summary
  19. FAQ
  20. What Next

Idea In 30 Seconds

Tool mocking and fault injection let you reproduce API failures in a controlled way and verify how an agent handles them, without real network calls and without non-deterministic noise.

The core value is controlled reproduction of timeout, 5xx, or broken responses, with explicit checks for retry, fallback, and stop reason.

Problem

Without mocks and fault injection, teams usually see only happy-path behavior:

  • tool responds fast;
  • response is valid;
  • agent finishes run without errors.

In production, this is rare. Tools can return timeouts, partial failures, empty fields, or unstable latency.

Without dedicated failure tests, this usually leads to:

  • unpredictable failures in critical scenarios;
  • infinite call retries;
  • expensive and noisy incidents that are hard to reproduce.

When To Use

This approach is needed when an agent relies on external tools:

  • payments API, CRM, search, backend services;
  • tools with retry/backoff logic;
  • scenarios where correct stop_reason is critical;
  • fallback scenarios (for example, backup tool or safe response).

If a tool failure can be modeled locally, it is a good candidate for a fault-injection test.

Implementation

In practice, this follows a simple rule: one failure type, one test, controlled conditions. Examples below are schematic and not tied to a specific framework.

How It Works In One Test

Short fault-injection test cycle
  • Test case - one behavior to validate.
  • Mock tool - lock input/output contract.
  • Inject fault - apply a specific failure (timeout, 5xx, bad_payload).
  • Run - execute a concrete agent step.
  • Assertions - verify retry, fallback, stop_reason, and error format.

1. Lock mock-tool contract

PYTHON
class FakePaymentsAPI:
    def __init__(self, mode: str = "ok"):
        self.mode = mode

    def refund(self, order_id: str):
        if self.mode == "ok":
            return {"status": "approved", "order_id": order_id}
        if self.mode == "timeout":
            raise TimeoutError("payments_timeout")
        if self.mode == "http_500":
            raise RuntimeError("payments_500")
        return {"status": None}

Mock should reproduce real tool contract as closely as possible. Otherwise, tests create false confidence.

2. Inject failures in a controlled way

PYTHON
def test_timeout_fault_is_injected():
    payments = FakePaymentsAPI(mode="timeout")
    agent = Agent(payments_api=payments)

    result = agent.handle_refund("order-8472")

    assert result.stop_reason in {"tool_error_handled", "fallback_used"}

Failure profile must be explicit and repeatable: the same test should always reproduce the same failure shape.

3. Verify retry and fallback

PYTHON
def test_retry_then_fallback():
    payments = FlakyPaymentsAPI(fail_times=2, then="timeout")
    backup = FakeBackupTool()
    agent = Agent(payments_api=payments, backup_tool=backup, max_retries=2)

    result = agent.handle_refund("order-9001")

    assert payments.calls == 2
    assert result.selected_tool == "backup_tool"
    assert result.stop_reason == "fallback_used"

You should verify not only that an error happened, but also recovery policy after error.

For retry flows, verify not only retry count, but also conditions when system stops retrying and moves to fallback or fail.

For tools with side effects, verify that retry does not create duplicate operations.

4. Lock error structure

PYTHON
def test_error_envelope_is_stable():
    payments = FakePaymentsAPI(mode="http_500")
    agent = Agent(payments_api=payments)
    result = agent.handle_refund("order-1122")

    assert result.error["code"] == "tool_error"
    assert result.error["tool"] == "payments_api"
    assert result.error["retryable"] is True

Stable error format simplifies debugging, alerting, and regression checks.

5. Run these tests in CI

These tests should run in every PR via the standard pytest step in CI when changes touch tool logic, retries, or fallback rules.

Typical Mistakes

Mock does not match real contract

Test passes, but in production agent fails because field structure or error code is different.

Typical cause: mock returns oversimplified payload that does not resemble real API.

Testing only happy-path

Tests include only "successful response", without timeout, 5xx, and invalid payload.

Typical cause: no required-failure-profile list for each critical tool.

Random fault injection

The same test sometimes passes and sometimes fails.

Typical cause: random failures without fixed seed or unstable timeouts.

No checks for stop_reason and error shape

Team checks only final response text, while recovery behavior remains untested.

Typical cause: missing structural assertions for stop_reason, error.code, selected_tool.

No side-effect checks during retry

Retry formally handles failure, but creates duplicate operation or duplicate write.

Typical cause: tests cover only stop_reason and fallback, but not idempotency of tool layer.

Mixing unit and integration checks

Test is called unit, but it calls a real API.

Typical cause: no boundary between local tests (mocks/fault injection) and integration layer.

Summary

Quick take
  • Tool mocking and fault injection validate how agent handles tool failures.
  • One failure type should be covered by one deterministic test.
  • Validate not only text, but retry, fallback, stop_reason, and error format.
  • Critical fault tests should run in every PR.

FAQ

Q: Can we test failures without real API?
A: Yes. At unit level this is standard: fakes and mocks provide stable, reproducible signal.

Q: What is more important: retry or fallback?
A: Both. Retry covers short failures; fallback protects scenario when primary tool remains unavailable.

Q: How many failure profiles per tool should we have?
A: Minimum three: timeout, server error (5xx), and invalid payload.

Q: Does this replace eval harness and regression?
A: No. These tests cover local tool-layer behavior. System behavior on complete scenarios is checked via eval harness and regression.

What Next

Add fault cases to Eval Harness and lock them in Golden Datasets. For version-to-version control, add Regression Testing, and analyze incidents with Replay and Debugging.

⏱️ 5 min read β€’ Updated March 13, 2026Difficulty: β˜…β˜…β˜†
Integrated: production controlOnceOnly
Add guardrails to tool-calling agents
Ship this pattern with governance:
  • Budgets (steps / spend caps)
  • Tool permissions (allowlist / blocklist)
  • Kill switch & incident stop
  • Idempotency & dedupe
  • Audit logs & traceability
Integrated mention: OnceOnly is a control layer for production agent systems.
Author

This documentation is curated and maintained by engineers who ship AI agents in production.

The content is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Patterns and recommendations are grounded in post-mortems, failure modes, and operational incidents in deployed systems, including during the development and operation of governance infrastructure for agents at OnceOnly.