Unit Tests für KI‑Agenten (deterministisch, billig, wirklich nützlich)

Problem (was zuerst kaputtgeht)

In dev „funktioniert“ dein Agent.

Dann änderst du etwas Langweiliges:

ein Tool‑Schema‑Key
Retry/Backoff Defaults
die Stop‑Condition
die Model‑Version

Und plötzlich sieht Production so aus:

3× mehr Tool Calls
2× Kosten über Nacht
Runs, die nie finish erreichen und einfach „stop: budget“ machen

Wenn du einen Run nicht deterministisch reproduzieren kannst, hast du keinen Bug — du hast Archäologie.

Warum das in Prod scheitert

Agents scheitern anders als „normale“ Software, weil sie getrieben sind von:

einem probabilistischen Planner (das Modell)
einer Loop‑Runtime (deine Orchestrierung)
Side Effects (Tools)
externer Instabilität (429/5xx, Partial Responses, Timeouts)

Viele Teams „testen“ den Prompt. Reicht nicht. Du musst den Loop‑Contract testen:

Input → Actions → Tool Calls → Trace → stop_reason

Wenn Stop Reasons und Traces nicht stabil sind, ist sonst nichts stabil.

Diagramm: so sieht ein unit‑testbarer Agent aus

Echter Code: eine unit‑testbare Loop (Python + JS)

Der Trick ist langweilig: Dependency Injection. Deine Loop akzeptiert zwei Dinge, die sie nicht kontrollieren kann:

llm.next_action(...)
tools.call(...)

Alles andere sollte deterministisch sein und in Tests asserted werden.

PythonJS

PYTHON

from dataclasses import dataclass
from typing import Any, Dict, List, Protocol


@dataclass(frozen=True)
class Budget:
  max_steps: int = 10
  max_tool_calls: int = 10


class LLM(Protocol):
  def next_action(self, state: Dict[str, Any]) -> Dict[str, Any]: ...


class Tools(Protocol):
  def call(self, name: str, args: Dict[str, Any]) -> Dict[str, Any]: ...


def run_agent(task: str, *, llm: LLM, tools: Tools, budget: Budget) -> Dict[str, Any]:
  trace: List[Dict[str, Any]] = []
  tool_calls = 0
  state: Dict[str, Any] = {"task": task, "notes": []}

  for step in range(budget.max_steps):
      action = llm.next_action(state)
      trace.append({"step": step, "action": action})

      if action.get("type") == "finish":
          return {"output": action.get("answer", ""), "trace": trace, "stop_reason": "finish"}

      if action.get("type") != "tool":
          return {"output": "", "trace": trace, "stop_reason": "invalid_action"}

      tool_calls += 1
      if tool_calls > budget.max_tool_calls:
          return {"output": "", "trace": trace, "stop_reason": "max_tool_calls"}

      obs = tools.call(action["tool"], action.get("args", {}))
      trace.append({"step": step, "observation": obs, "tool": action["tool"]})
      state["notes"].append(obs)

  return {"output": "", "trace": trace, "stop_reason": "max_steps"}


# --- unit test (pytest style) ---
class FakeLLM:
  def __init__(self):
      self.n = 0

  def next_action(self, state):
      self.n += 1
      if self.n == 1:
          return {"type": "tool", "tool": "http.get", "args": {"url": "https://example.com"}}
      return {"type": "finish", "answer": "ok"}


class FakeTools:
  def __init__(self):
      self.calls = []

  def call(self, name, args):
      self.calls.append((name, args))
      return {"ok": True, "status": 200, "body": "hello"}


def test_unit_loop_contract():
  out = run_agent(
      "fetch once and finish",
      llm=FakeLLM(),
      tools=FakeTools(),
      budget=Budget(max_steps=5, max_tool_calls=3),
  )

  assert out["stop_reason"] == "finish"
  assert len(out["trace"]) >= 2

JAVASCRIPT

export function runAgent(task, { llm, tools, budget }) {
const trace = [];
let toolCalls = 0;
const state = { task, notes: [] };

for (let step = 0; step < budget.maxSteps; step++) {
  const action = llm.nextAction(state);
  trace.push({ step, action });

  if (action?.type === "finish") {
    return { output: action.answer ?? "", trace, stop_reason: "finish" };
  }

  if (action?.type !== "tool") {
    return { output: "", trace, stop_reason: "invalid_action" };
  }

  toolCalls += 1;
  if (toolCalls > budget.maxToolCalls) {
    return { output: "", trace, stop_reason: "max_tool_calls" };
  }

  const obs = tools.call(action.tool, action.args || {});
  trace.push({ step, tool: action.tool, observation: obs });
  state.notes.push(obs);
}

return { output: "", trace, stop_reason: "max_steps" };
}

// --- unit test (jest style) ---
test("unit loop contract", () => {
const llm = {
  n: 0,
  nextAction() {
    this.n += 1;
    if (this.n === 1) return { type: "tool", tool: "http.get", args: { url: "https://example.com" } };
    return { type: "finish", answer: "ok" };
  },
};
const tools = { calls: [], call(name, args) { this.calls.push([name, args]); return { ok: true, status: 200 }; } };

const out = runAgent("fetch once and finish", {
  llm,
  tools,
  budget: { maxSteps: 5, maxToolCalls: 3 },
});

expect(out.stop_reason).toBe("finish");
expect(tools.calls.length).toBe(1);
});

Realer Ausfall (der weh tut)

Wir haben einmal einen Retry‑Default in einem geteilten Tool‑Wrapper geändert. Nichts ist „gecrasht“.

Aber Tool Calls haben sich auf einer busy Route verdoppelt:

Ø Tool Calls/Run: 8 → 16
Cost Impact: +~$900/Tag (Tokens + Tool Credits)
On‑Call: ~3 Stunden, um zu beweisen, dass es nicht „das Modell ist“

Der Fix war kein besserer Prompt. Der Fix war ein Unit Test, der asserted:

max Tool Calls/Run bleibt in einem Bound
Stop‑Reason‑Taxonomie driftet nicht
das Tool Gateway retried nicht doppelt (Agent‑Retries + Tool‑Retries = Storm)

Abwägungen

Unit Tests beweisen nicht, dass das Modell „smart“ ist. Sie beweisen, dass deine Loop safe ist.
Deterministische Stubs verstecken echte Tool‑Flakiness (dafür sind Replay‑Tests da).
Du schreibst mehr langweiligen Code. Du wirst auch weniger gepaged.

Wann du das NICHT so machen solltest

Teste „Prompt‑Qualität“ nicht wie eine deterministische Funktion. Wenn das Ziel Stil/Ton ist: Sampling + Evals.

Unit‑testen solltest du:

Budgets
Tool Allowlists
Stop Reasons
Action Schema Validation
Idempotency‑Verhalten

Copy/Paste Checkliste

[ ] llm und tools als Interfaces injizieren (keine Globals).
[ ] In jedem Test auf stop_reason assert’en.
[ ] Tool Calls assert’en: Count + Sequence + Args‑Hash (nicht Raw Args wenn sensibel).
[ ] “Bad paths” testen: invalid action, tool error, budget stop.
[ ] Ein Golden Test pro Prod‑Incident (ja, wirklich).

Sicheres Default‑Config‑Snippet (YAML)

YAML

agent_tests:
  budgets:
    max_steps: 25
    max_tool_calls: 12
  invariants:
    stop_reason_required: true
    action_schema_strict: true
    tool_allowlist_required: true
  golden_tasks:
    - id: "fetch_once"
      task: "Fetch https://example.com and summarize in 3 bullets."
      expect_stop_reason: "finish"
      max_tool_calls: 2
  replay:
    enabled: true
    mode: "record_then_replay"
    store: ".agent-replays/"

FAQ (3–5)

Von Patterns genutzt

Erforderliche Governance

Ist das nicht nur „Model Mocking“?

Ja — absichtlich. Unit Tests sind für deinen Loop‑Contract: Budgets, Tool‑Gateway‑Verhalten, Stop Reasons und Trace‑Shape.

Worauf soll ich assert’en?

Stop Reason, Tool‑Call‑Count, Allowlist‑Decisions und Trace‑Shape. Nicht auf exakten Prosa‑Output.

Wie teste ich Tool‑Flakiness?

Record/Replay‑Fixtures (oder sandboxed Integration Tests). Unit Tests sollten deterministisch bleiben.

Brauche ich Evals, wenn ich Unit Tests habe?

Ja. Unit Tests verhindern Incidents. Evals fangen Quality Drift ab. Unterschiedliche Failure‑Klassen.

Q: Ist das nicht nur „Model Mocking“?
A: Ja — absichtlich. Unit Tests sind für deinen Loop‑Contract: Budgets, Tool‑Gateway‑Verhalten, Stop Reasons und Trace‑Shape.

Q: Worauf soll ich assert’en?
A: Stop Reason, Tool‑Call‑Count, Allowlist‑Decisions und Trace‑Shape. Nicht auf exakten Prosa‑Output.

Q: Wie teste ich Tool‑Flakiness?
A: Record/Replay‑Fixtures (oder sandboxed Integration Tests). Unit Tests sollten deterministisch bleiben.

Q: Brauche ich Evals, wenn ich Unit Tests habe?
A: Ja. Unit Tests verhindern Incidents. Evals fangen Quality Drift ab. Unterschiedliche Failure‑Klassen.