Idea In 30 Seconds
Unit tests for AI agents validate local logic: tool selection, response handling, stop reason, and output format.
Their main value is speed, determinism, and isolation, which helps you immediately see which exact part of the system broke.
Problem
Without unit tests, teams often validate agents only through manual runs or heavy end-to-end tests.
This creates common problems:
- errors in local logic are found too late;
- it is hard to tell whether code broke or an external dependency failed;
- small regressions accumulate and reach production.
As a result, even a simple change can trigger a chain of opaque failures in production.
When To Use
You should write unit tests whenever you have local and verifiable logic:
- tool selection by request type;
- output schema validation;
- tool error handling;
- run completion conditions (
stop_reason); - safety rules at step or function level.
If behavior can be validated without network calls and without full agent runtime, it is a good unit-test candidate.
Implementation
In practice, unit testing for agents follows a simple rule: one behavior, one test, controlled conditions. The examples below are schematic and not tied to a specific framework.
Unit level is not suitable for evaluating overall response quality, usefulness of final output, or the agent's general "smartness". For that, use eval harness and golden datasets.
How It Works In One Test
Short unit-test cycle
- Test case - one behavior to validate.
- Setup - fakes, mocks, and fixed conditions.
- Run - execute a specific function or step.
- Assertions - validate
tool choice,schema,stop reason.
1. Isolate agent decision logic
def choose_tool(intent: str, tools_allowed: list[str]) -> str:
if intent == "price_lookup" and "crypto_price_api" in tools_allowed:
return "crypto_price_api"
return "web_search"
The fewer side dependencies a function has, the more stable the test.
2. Replace external tools
class FakeTools:
def crypto_price_api(self, symbol: str):
return {"symbol": symbol, "price": 65000}
A unit test should validate agent logic, not availability of external APIs.
3. Validate more than final response text
def test_tool_selection_and_schema():
tools = FakeTools()
agent = Agent(tools=tools)
result = agent.run("What is the price of BTC?")
assert result.selected_tool == "crypto_price_api"
assert isinstance(result.output, dict)
assert result.output["symbol"] == "BTC"
It is better to lock structural invariants (selected_tool, schema, stop reason), not only final text.
4. Test negative scenarios
def test_tool_error_is_handled():
tools = FailingTools()
agent = Agent(tools=tools)
result = agent.run("Find BTC price")
assert result.stop_reason == "tool_error_handled"
assert result.error is not None
Tool failures should have predictable and testable behavior.
5. Integrate unit tests into CI
name: unit-tests
on:
pull_request:
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- run: pip install -r requirements.txt
- run: pytest tests/unit -q
If a test is slow or unstable, move it to eval harness or integration layer.
Typical Mistakes
Dependence on real APIs
The test fails not because of agent logic, but because of network or external service availability.
Typical cause: missing fakes or mocks for tools.
Testing only final text
The test is green, but it does not guarantee correct tool selection or output format.
Typical cause: no checks for selected_tool, schema, and stop reason.
Too much logic in one test
One test validates several scenarios at once, and when it fails it is unclear what exactly broke.
Typical cause: no "one test, one behavior" rule.
Unstable test environment
Even correct unit tests become noisy if dependencies, configuration, or tool replacements drift between runs.
Typical cause: unit tests still partially depend on real runtime or external calls.
Trying to cover everything via e2e runs
The team writes only large scenarios and skips basic local validation.
Typical cause: no clear split between unit, eval, and regression levels.
Summary
- Unit tests for agents validate local and deterministic logic.
- Replace tools via fakes or mocks to remove network noise.
- Lock structural checks: tool choice, schema, stop reason.
- Fast unit tests should run in every PR.
FAQ
Q: Can unit tests replace eval harness?
A: No. Unit tests catch local failures, while eval harness validates full agent behavior on complete scenarios.
Q: Should we connect a real LLM in unit tests?
A: Prefer minimal usage. For unit level, deterministic logic with fakes or mocks and controlled conditions works better.
Q: What must every agent unit test verify?
A: Tool selection, output structure, error handling, and stop reason in negative scenarios.
Q: When should a test move from unit level to eval level?
A: When it depends on full scenario behavior, response-quality metrics, or baseline comparisons.
What Next
After unit level, add scenario validation through Eval Harness, and maintain a stable case set through Golden Datasets.
For version-to-version control, add Regression Testing. For production-incident analysis, use Replay and Debugging. Keep the full picture in Testing Strategy.