Юніт‑тести для AI‑агентів (детерміновано, дешево, реально корисно)

Проблема (що ламається першим)

У dev агент “працює”.

Потім ти міняєш щось нудне:

ключ у schema tool’а
дефолти retry/backoff
stop condition
версію моделі

І раптом прод виглядає так:

3× більше tool calls
2× вартість за ніч
runs, які не доходять до finish і закінчуються “stop: budget”

Якщо ти не можеш відтворити run детерміновано — у тебе не баг. У тебе археологія.

Чому це ламається в проді

Агенти ламаються інакше, ніж звичайний код, бо ними керує:

probabilistic planner (модель)
runtime loop (твоя оркестрація)
side effects (tools)
нестабільне середовище (429/5xx, partial responses, timeouts)

Багато команд “тестять” промпт. Це не достатньо. Треба тестити контракт лупа:

input → actions → tool calls → trace → stop_reason

Якщо stop reasons і traces не стабільні — нічого не стабільне.

Діаграма: як виглядає юніт‑тестований агент

Реальний код: юніт‑тестований луп (Python + JS)

Трюк нудний: dependency injection. Луп приймає дві штуки, які він не контролює:

llm.next_action(...)
tools.call(...)

Все інше має бути детермінованим і перевірюваним.

PythonJS

PYTHON

from dataclasses import dataclass
from typing import Any, Dict, List, Protocol


@dataclass(frozen=True)
class Budget:
  max_steps: int = 10
  max_tool_calls: int = 10


class LLM(Protocol):
  def next_action(self, state: Dict[str, Any]) -> Dict[str, Any]: ...


class Tools(Protocol):
  def call(self, name: str, args: Dict[str, Any]) -> Dict[str, Any]: ...


def run_agent(task: str, *, llm: LLM, tools: Tools, budget: Budget) -> Dict[str, Any]:
  trace: List[Dict[str, Any]] = []
  tool_calls = 0
  state: Dict[str, Any] = {"task": task, "notes": []}

  for step in range(budget.max_steps):
      action = llm.next_action(state)
      trace.append({"step": step, "action": action})

      if action.get("type") == "finish":
          return {"output": action.get("answer", ""), "trace": trace, "stop_reason": "finish"}

      if action.get("type") != "tool":
          return {"output": "", "trace": trace, "stop_reason": "invalid_action"}

      tool_calls += 1
      if tool_calls > budget.max_tool_calls:
          return {"output": "", "trace": trace, "stop_reason": "max_tool_calls"}

      obs = tools.call(action["tool"], action.get("args", {}))
      trace.append({"step": step, "observation": obs, "tool": action["tool"]})
      state["notes"].append(obs)

  return {"output": "", "trace": trace, "stop_reason": "max_steps"}


# --- unit test (pytest style) ---
class FakeLLM:
  def __init__(self):
      self.n = 0

  def next_action(self, state):
      self.n += 1
      if self.n == 1:
          return {"type": "tool", "tool": "http.get", "args": {"url": "https://example.com"}}
      return {"type": "finish", "answer": "ok"}


class FakeTools:
  def __init__(self):
      self.calls = []

  def call(self, name, args):
      self.calls.append((name, args))
      return {"ok": True, "status": 200, "body": "hello"}


def test_unit_loop_contract():
  out = run_agent(
      "fetch once and finish",
      llm=FakeLLM(),
      tools=FakeTools(),
      budget=Budget(max_steps=5, max_tool_calls=3),
  )

  assert out["stop_reason"] == "finish"
  assert len(out["trace"]) >= 2

JAVASCRIPT

export function runAgent(task, { llm, tools, budget }) {
const trace = [];
let toolCalls = 0;
const state = { task, notes: [] };

for (let step = 0; step < budget.maxSteps; step++) {
  const action = llm.nextAction(state);
  trace.push({ step, action });

  if (action?.type === "finish") {
    return { output: action.answer ?? "", trace, stop_reason: "finish" };
  }

  if (action?.type !== "tool") {
    return { output: "", trace, stop_reason: "invalid_action" };
  }

  toolCalls += 1;
  if (toolCalls > budget.maxToolCalls) {
    return { output: "", trace, stop_reason: "max_tool_calls" };
  }

  const obs = tools.call(action.tool, action.args || {});
  trace.push({ step, tool: action.tool, observation: obs });
  state.notes.push(obs);
}

return { output: "", trace, stop_reason: "max_steps" };
}

// --- unit test (jest style) ---
test("unit loop contract", () => {
const llm = {
  n: 0,
  nextAction() {
    this.n += 1;
    if (this.n === 1) return { type: "tool", tool: "http.get", args: { url: "https://example.com" } };
    return { type: "finish", answer: "ok" };
  },
};
const tools = { calls: [], call(name, args) { this.calls.push([name, args]); return { ok: true, status: 200 }; } };

const out = runAgent("fetch once and finish", {
  llm,
  tools,
  budget: { maxSteps: 5, maxToolCalls: 3 },
});

expect(out.stop_reason).toBe("finish");
expect(tools.calls.length).toBe(1);
});

Реальний фейл (той, що болить)

Ми якось змінили retry‑дефолт у спільному wrapper’і tool’а. Нічого “не впало”.

Але на busy route tool calls подвоїлись:

середнє tool calls/run: 8 → 16
cost impact: +~$900/день (tokens + tool credits)
on‑call: ~3 години, щоб довести, що це не “модель стала дивною”

Фікс був не “кращий промпт”. Фікс — юніт‑тест, який перевіряє:

max tool calls/run не вилітає за bound
taxonomія stop reasons не дрейфить
tool gateway не ретраїть двічі (agent retries + tool retries = storm)

Компроміси

Юніт‑тести не доводять, що модель “розумна”. Вони доводять, що твій луп safe.
Детерміновані stubs ховають реальну flakiness tools (для цього є replay‑тести).
Буде більше нудного коду. Буде менше пейджерів.

Коли НЕ варто так робити

Не “unit test” якість промпта як детерміновану функцію. Якщо мета — стиль/тон: sampling + evals.

Юніт‑тести мають покривати:

budgets
tool allowlists
stop reasons
валідацію action schema
idempotency‑поведінку

Чекліст (можна копіювати)

[ ] Інжектити llm і tools як інтерфейси (без global state).
[ ] У кожному тесті assert на stop_reason.
[ ] Assert tool calls: кількість + порядок + args hash (не raw args, якщо чутливі).
[ ] Тестити “bad paths”: invalid action, tool error, budget stop.
[ ] Один golden test на кожен прод‑інцидент (так).

Безпечний дефолтний конфіг (YAML)

YAML

agent_tests:
  budgets:
    max_steps: 25
    max_tool_calls: 12
  invariants:
    stop_reason_required: true
    action_schema_strict: true
    tool_allowlist_required: true
  golden_tasks:
    - id: "fetch_once"
      task: "Fetch https://example.com and summarize in 3 bullets."
      expect_stop_reason: "finish"
      max_tool_calls: 2
  replay:
    enabled: true
    mode: "record_then_replay"
    store: ".agent-replays/"

FAQ (3–5)

Використовується в патернах

Пов’язані відмови

Потрібні контролі

Це ж просто mocking моделі?

Так — навмисно. Юніт‑тести для контракту лупа: budgets, поведінка tool gateway, stop reasons і форма trace.

Що саме варто перевіряти?

Stop reason, кількість tool calls, рішення allowlist і форму trace. Не асерть точний текст відповіді.

Як тестити flakiness tools?

Record/replay fixtures (або інтеграційні тести в sandbox). Юніт‑тести мають бути детерміновані.

Юніт‑тестів достатньо, чи треба evals?

Треба. Юніт‑тести зупиняють інциденти. Evals ловлять quality drift. Це різні класи фейлів.

Q: Це ж просто mocking моделі?
A: Так — навмисно. Юніт‑тести для контракту лупа: budgets, поведінка tool gateway, stop reasons і форма trace.

Q: Що саме варто перевіряти?
A: Stop reason, кількість tool calls, рішення allowlist і форму trace. Не асерть точний текст відповіді.

Q: Як тестити flakiness tools?
A: Record/replay fixtures (або інтеграційні тести в sandbox). Юніт‑тести мають бути детерміновані.

Q: Юніт‑тестів достатньо, чи треба evals?
A: Треба. Юніт‑тести зупиняють інциденти. Evals ловлять quality drift. Це різні класи фейлів.

Пов’язані сторінки (3–6 лінків)

Основи: Виклик інструментів AI‑агентом (з кодом) · Що робить агента production-ready (guardrails + код)
Фейли: Tool spam loops (failure mode + фікси + код) · Budget explosion (коли агент спалює гроші) + фікси + код
Governance: Budget Controls для AI агентів (кроки, час, $) + Код
Продакшен‑стек: Продакшен‑стек AI‑агента