Unit Testing AI Agents (Deterministic, Cheap, and Actually Useful)

Problem (what breaks first)

Your agent “works” in dev.

Then you change something boring:

a tool schema key
retry/backoff defaults
the stop condition
the model version

And suddenly production looks like:

3× more tool calls
2× cost overnight
runs that never hit finish and just “stop: budget”

If you can’t reproduce a run deterministically, you don’t have a bug — you have archaeology.

Why this fails in production

Agents fail differently than normal code because they’re driven by:

a probabilistic planner (the model)
a runtime loop (your orchestration)
side effects (tools)
external instability (429/5xx, partial responses, timeouts)

Most teams “test” the prompt. That’s not enough. You need to test the loop contract:

inputs → actions → tool calls → trace → stop_reason

If stop reasons and traces aren’t stable, nothing else is.

Diagram: what a unit-testable agent looks like

Real code: a unit-testable loop (Python + JS)

The trick is boring: dependency injection. Your loop accepts two things it can’t control:

llm.next_action(...)
tools.call(...)

Everything else should be deterministic and asserted.

PythonJS

PYTHON

from dataclasses import dataclass
from typing import Any, Dict, List, Protocol


@dataclass(frozen=True)
class Budget:
  max_steps: int = 10
  max_tool_calls: int = 10


class LLM(Protocol):
  def next_action(self, state: Dict[str, Any]) -> Dict[str, Any]: ...


class Tools(Protocol):
  def call(self, name: str, args: Dict[str, Any]) -> Dict[str, Any]: ...


def run_agent(task: str, *, llm: LLM, tools: Tools, budget: Budget) -> Dict[str, Any]:
  trace: List[Dict[str, Any]] = []
  tool_calls = 0
  state: Dict[str, Any] = {"task": task, "notes": []}

  for step in range(budget.max_steps):
      action = llm.next_action(state)
      trace.append({"step": step, "action": action})

      if action.get("type") == "finish":
          return {"output": action.get("answer", ""), "trace": trace, "stop_reason": "finish"}

      if action.get("type") != "tool":
          return {"output": "", "trace": trace, "stop_reason": "invalid_action"}

      tool_calls += 1
      if tool_calls > budget.max_tool_calls:
          return {"output": "", "trace": trace, "stop_reason": "max_tool_calls"}

      obs = tools.call(action["tool"], action.get("args", {}))
      trace.append({"step": step, "observation": obs, "tool": action["tool"]})
      state["notes"].append(obs)

  return {"output": "", "trace": trace, "stop_reason": "max_steps"}


# --- unit test (pytest style) ---
class FakeLLM:
  def __init__(self):
      self.n = 0

  def next_action(self, state):
      self.n += 1
      if self.n == 1:
          return {"type": "tool", "tool": "http.get", "args": {"url": "https://example.com"}}
      return {"type": "finish", "answer": "ok"}


class FakeTools:
  def __init__(self):
      self.calls = []

  def call(self, name, args):
      self.calls.append((name, args))
      return {"ok": True, "status": 200, "body": "hello"}


def test_unit_loop_contract():
  out = run_agent(
      "fetch once and finish",
      llm=FakeLLM(),
      tools=FakeTools(),
      budget=Budget(max_steps=5, max_tool_calls=3),
  )

  assert out["stop_reason"] == "finish"
  assert len(out["trace"]) >= 2

JAVASCRIPT

export function runAgent(task, { llm, tools, budget }) {
const trace = [];
let toolCalls = 0;
const state = { task, notes: [] };

for (let step = 0; step < budget.maxSteps; step++) {
  const action = llm.nextAction(state);
  trace.push({ step, action });

  if (action?.type === "finish") {
    return { output: action.answer ?? "", trace, stop_reason: "finish" };
  }

  if (action?.type !== "tool") {
    return { output: "", trace, stop_reason: "invalid_action" };
  }

  toolCalls += 1;
  if (toolCalls > budget.maxToolCalls) {
    return { output: "", trace, stop_reason: "max_tool_calls" };
  }

  const obs = tools.call(action.tool, action.args || {});
  trace.push({ step, tool: action.tool, observation: obs });
  state.notes.push(obs);
}

return { output: "", trace, stop_reason: "max_steps" };
}

// --- unit test (jest style) ---
test("unit loop contract", () => {
const llm = {
  n: 0,
  nextAction() {
    this.n += 1;
    if (this.n === 1) return { type: "tool", tool: "http.get", args: { url: "https://example.com" } };
    return { type: "finish", answer: "ok" };
  },
};
const tools = { calls: [], call(name, args) { this.calls.push([name, args]); return { ok: true, status: 200 }; } };

const out = runAgent("fetch once and finish", {
  llm,
  tools,
  budget: { maxSteps: 5, maxToolCalls: 3 },
});

expect(out.stop_reason).toBe("finish");
expect(tools.calls.length).toBe(1);
});

Real failure (the one that hurts)

We once changed a retry default in a shared tool wrapper. Nothing “crashed”.

But tool calls doubled on a busy route:

average tool calls/run: 8 → 16
cost impact: +~$900/day (tokens + tool credits)
on-call time: ~3 hours to prove it wasn’t “the model being weird”

The fix wasn’t a better prompt. The fix was a unit test that asserted:

max tool calls per run stays within a bound
stop_reason taxonomy doesn’t drift
the tool gateway doesn’t retry twice (agent retries + tool retries = storm)

Trade-offs

Unit tests won’t prove the model is “smart”. They prove your loop is safe.
Deterministic stubs can hide real-world tool flakiness (that’s what replay tests are for).
You’ll write more boring code. You’ll also page less.

When NOT to use this

Don’t unit test “prompt quality” as if it’s a deterministic function. If the goal is style/tone, use sampling + evals.

Do unit test:

budgets
tool allowlists
stop reasons
action schema validation
idempotency behavior

Copy-paste checklist

[ ] Inject llm and tools as interfaces (no globals).
[ ] Assert on stop_reason for every test.
[ ] Assert tool calls: count + sequence + args hash (not raw args if sensitive).
[ ] Test “bad paths”: invalid action, tool error, budget stop.
[ ] One golden test per production incident (yes, really).

Safe default config snippet (YAML)

YAML

agent_tests:
  budgets:
    max_steps: 25
    max_tool_calls: 12
  invariants:
    stop_reason_required: true
    action_schema_strict: true
    tool_allowlist_required: true
  golden_tasks:
    - id: "fetch_once"
      task: "Fetch https://example.com and summarize in 3 bullets."
      expect_stop_reason: "finish"
      max_tool_calls: 2
  replay:
    enabled: true
    mode: "record_then_replay"
    store: ".agent-replays/"

FAQ (3–5)

Used by patterns

Related failures

Governance required

Isn’t this just mocking the model?

Yes — on purpose. Unit tests are for your loop contract: budgets, tool gateway behavior, stop reasons, and trace shape.

What should I assert on?

Stop reason, tool call count, tool allowlist decisions, and trace shape. Don’t assert on the exact prose output.

How do I test tool flakiness?

Record/replay fixtures (or sandboxed integration tests). Unit tests should stay deterministic.

Do I need evals if I have unit tests?

Yes. Unit tests stop incidents. Evals catch quality drift. They’re different failure classes.

Q: Isn’t this just mocking the model?
A: Yes — on purpose. Unit tests are for your loop contract: budgets, tool gateway behavior, stop reasons, and trace shape.

Q: What should I assert on?
A: Stop reason, tool call count, tool allowlist decisions, and trace shape. Don’t assert on the exact prose output.

Q: How do I test tool flakiness?
A: Record/replay fixtures (or sandboxed integration tests). Unit tests should stay deterministic.

Q: Do I need evals if I have unit tests?
A: Yes. Unit tests stop incidents. Evals catch quality drift. They’re different failure classes.

Foundations: Tool calling · What makes an agent production-ready
Failures: Tool spam · Budget explosion
Governance: Budget controls
Production stack: AI Agent Production Stack