Unit Testing AI Agents (Deterministic, Cheap, and Actually Useful)

Unit testing AI agents is how you catch tool spam, budget bugs, and stop-reason regressions before production. Includes Python + JS examples.
On this page
  1. Problem (what breaks first)
  2. Why this fails in production
  3. Diagram: what a unit-testable agent looks like
  4. Real code: a unit-testable loop (Python + JS)
  5. Real failure (the one that hurts)
  6. Trade-offs
  7. When NOT to use this
  8. Copy-paste checklist
  9. Safe default config snippet (YAML)
  10. FAQ (3–5)
  11. Related pages (3–6 links)

Problem (what breaks first)

Your agent “works” in dev.

Then you change something boring:

  • a tool schema key
  • retry/backoff defaults
  • the stop condition
  • the model version

And suddenly production looks like:

  • 3× more tool calls
  • 2× cost overnight
  • runs that never hit finish and just “stop: budget”

If you can’t reproduce a run deterministically, you don’t have a bug — you have archaeology.

Why this fails in production

Agents fail differently than normal code because they’re driven by:

  • a probabilistic planner (the model)
  • a runtime loop (your orchestration)
  • side effects (tools)
  • external instability (429/5xx, partial responses, timeouts)

Most teams “test” the prompt. That’s not enough. You need to test the loop contract:

  • inputs → actions → tool calls → trace → stop_reason

If stop reasons and traces aren’t stable, nothing else is.

Diagram: what a unit-testable agent looks like

Real code: a unit-testable loop (Python + JS)

The trick is boring: dependency injection. Your loop accepts two things it can’t control:

  • llm.next_action(...)
  • tools.call(...)

Everything else should be deterministic and asserted.

PYTHON
from dataclasses import dataclass
from typing import Any, Dict, List, Protocol


@dataclass(frozen=True)
class Budget:
  max_steps: int = 10
  max_tool_calls: int = 10


class LLM(Protocol):
  def next_action(self, state: Dict[str, Any]) -> Dict[str, Any]: ...


class Tools(Protocol):
  def call(self, name: str, args: Dict[str, Any]) -> Dict[str, Any]: ...


def run_agent(task: str, *, llm: LLM, tools: Tools, budget: Budget) -> Dict[str, Any]:
  trace: List[Dict[str, Any]] = []
  tool_calls = 0
  state: Dict[str, Any] = {"task": task, "notes": []}

  for step in range(budget.max_steps):
      action = llm.next_action(state)
      trace.append({"step": step, "action": action})

      if action.get("type") == "finish":
          return {"output": action.get("answer", ""), "trace": trace, "stop_reason": "finish"}

      if action.get("type") != "tool":
          return {"output": "", "trace": trace, "stop_reason": "invalid_action"}

      tool_calls += 1
      if tool_calls > budget.max_tool_calls:
          return {"output": "", "trace": trace, "stop_reason": "max_tool_calls"}

      obs = tools.call(action["tool"], action.get("args", {}))
      trace.append({"step": step, "observation": obs, "tool": action["tool"]})
      state["notes"].append(obs)

  return {"output": "", "trace": trace, "stop_reason": "max_steps"}


# --- unit test (pytest style) ---
class FakeLLM:
  def __init__(self):
      self.n = 0

  def next_action(self, state):
      self.n += 1
      if self.n == 1:
          return {"type": "tool", "tool": "http.get", "args": {"url": "https://example.com"}}
      return {"type": "finish", "answer": "ok"}


class FakeTools:
  def __init__(self):
      self.calls = []

  def call(self, name, args):
      self.calls.append((name, args))
      return {"ok": True, "status": 200, "body": "hello"}


def test_unit_loop_contract():
  out = run_agent(
      "fetch once and finish",
      llm=FakeLLM(),
      tools=FakeTools(),
      budget=Budget(max_steps=5, max_tool_calls=3),
  )

  assert out["stop_reason"] == "finish"
  assert len(out["trace"]) >= 2
JAVASCRIPT
export function runAgent(task, { llm, tools, budget }) {
const trace = [];
let toolCalls = 0;
const state = { task, notes: [] };

for (let step = 0; step < budget.maxSteps; step++) {
  const action = llm.nextAction(state);
  trace.push({ step, action });

  if (action?.type === "finish") {
    return { output: action.answer ?? "", trace, stop_reason: "finish" };
  }

  if (action?.type !== "tool") {
    return { output: "", trace, stop_reason: "invalid_action" };
  }

  toolCalls += 1;
  if (toolCalls > budget.maxToolCalls) {
    return { output: "", trace, stop_reason: "max_tool_calls" };
  }

  const obs = tools.call(action.tool, action.args || {});
  trace.push({ step, tool: action.tool, observation: obs });
  state.notes.push(obs);
}

return { output: "", trace, stop_reason: "max_steps" };
}

// --- unit test (jest style) ---
test("unit loop contract", () => {
const llm = {
  n: 0,
  nextAction() {
    this.n += 1;
    if (this.n === 1) return { type: "tool", tool: "http.get", args: { url: "https://example.com" } };
    return { type: "finish", answer: "ok" };
  },
};
const tools = { calls: [], call(name, args) { this.calls.push([name, args]); return { ok: true, status: 200 }; } };

const out = runAgent("fetch once and finish", {
  llm,
  tools,
  budget: { maxSteps: 5, maxToolCalls: 3 },
});

expect(out.stop_reason).toBe("finish");
expect(tools.calls.length).toBe(1);
});

Real failure (the one that hurts)

We once changed a retry default in a shared tool wrapper. Nothing “crashed”.

But tool calls doubled on a busy route:

  • average tool calls/run: 8 → 16
  • cost impact: +~$900/day (tokens + tool credits)
  • on-call time: ~3 hours to prove it wasn’t “the model being weird”

The fix wasn’t a better prompt. The fix was a unit test that asserted:

  • max tool calls per run stays within a bound
  • stop_reason taxonomy doesn’t drift
  • the tool gateway doesn’t retry twice (agent retries + tool retries = storm)

Trade-offs

  • Unit tests won’t prove the model is “smart”. They prove your loop is safe.
  • Deterministic stubs can hide real-world tool flakiness (that’s what replay tests are for).
  • You’ll write more boring code. You’ll also page less.

When NOT to use this

Don’t unit test “prompt quality” as if it’s a deterministic function. If the goal is style/tone, use sampling + evals.

Do unit test:

  • budgets
  • tool allowlists
  • stop reasons
  • action schema validation
  • idempotency behavior

Copy-paste checklist

  • [ ] Inject llm and tools as interfaces (no globals).
  • [ ] Assert on stop_reason for every test.
  • [ ] Assert tool calls: count + sequence + args hash (not raw args if sensitive).
  • [ ] Test “bad paths”: invalid action, tool error, budget stop.
  • [ ] One golden test per production incident (yes, really).

Safe default config snippet (YAML)

YAML
agent_tests:
  budgets:
    max_steps: 25
    max_tool_calls: 12
  invariants:
    stop_reason_required: true
    action_schema_strict: true
    tool_allowlist_required: true
  golden_tasks:
    - id: "fetch_once"
      task: "Fetch https://example.com and summarize in 3 bullets."
      expect_stop_reason: "finish"
      max_tool_calls: 2
  replay:
    enabled: true
    mode: "record_then_replay"
    store: ".agent-replays/"

FAQ (3–5)

Isn’t this just mocking the model?
Yes — on purpose. Unit tests are for your loop contract: budgets, tool gateway behavior, stop reasons, and trace shape.
What should I assert on?
Stop reason, tool call count, tool allowlist decisions, and trace shape. Don’t assert on the exact prose output.
How do I test tool flakiness?
Record/replay fixtures (or sandboxed integration tests). Unit tests should stay deterministic.
Do I need evals if I have unit tests?
Yes. Unit tests stop incidents. Evals catch quality drift. They’re different failure classes.

Q: Isn’t this just mocking the model?
A: Yes — on purpose. Unit tests are for your loop contract: budgets, tool gateway behavior, stop reasons, and trace shape.

Q: What should I assert on?
A: Stop reason, tool call count, tool allowlist decisions, and trace shape. Don’t assert on the exact prose output.

Q: How do I test tool flakiness?
A: Record/replay fixtures (or sandboxed integration tests). Unit tests should stay deterministic.

Q: Do I need evals if I have unit tests?
A: Yes. Unit tests stop incidents. Evals catch quality drift. They’re different failure classes.

⏱ 6 min read ‱ Updated Mar, 2026Difficulty: ★★☆
Integrated: production controlOnceOnly
Add guardrails to tool-calling agents
Ship this pattern with governance:
  • Budgets (steps / spend caps)
  • Tool permissions (allowlist / blocklist)
  • Kill switch & incident stop
  • Idempotency & dedupe
  • Audit logs & traceability
Integrated mention: OnceOnly is a control layer for production agent systems.
Author

This documentation is curated and maintained by engineers who ship AI agents in production.

The content is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Patterns and recommendations are grounded in post-mortems, failure modes, and operational incidents in deployed systems, including during the development and operation of governance infrastructure for agents at OnceOnly.