Eval Harness for AI Agents: Repeatable Evaluations

An eval harness lets you run repeatable tests for AI agents and compare results across versions.
On this page
  1. Idea In 30 Seconds
  2. Problem
  3. Core Concept / Model
  4. How It Works
  5. Implementation
  6. 1. Test case structure
  7. 2. Runner for case execution
  8. 3. Scoring and baseline comparison
  9. 4. Report and CI gate
  10. 5. Release gate in overall strategy
  11. Typical Mistakes
  12. Unstable dataset
  13. Unpinned model version
  14. Manual runs instead of automation
  15. No comparison with baseline
  16. Mixed deterministic and non-deterministic checks
  17. Missing run artifacts
  18. Unstable eval runs
  19. Summary
  20. FAQ
  21. What Next

Idea In 30 Seconds

Eval harness is a way to run the same scenario set for an agent, score results with the same rules, and compare candidate against baseline.


Problem

Without eval harness, teams often test agents manually:

  • run several chat requests;
  • review a few sample answers;
  • conclude that a change looks safe.

This does not provide a stable signal: a change can look fine on random examples, yet break critical production scenarios.

Most common outcomes:

  • impossible to compare candidate and baseline fairly;
  • difficult to reproduce regressions;
  • CI has no clear rule for when to block release.

Core Concept / Model

Eval harness is not one test. It is a validation pipeline: fixed dataset, controlled run conditions, scoring, comparison with baseline, and reporting.

ComponentWhat it does
DatasetStores stable scenarios and expected behavior
RunnerRuns the agent on each scenario in the same conditions and collects run outputs
EvaluatorsApply deterministic checks, LLM-as-a-judge scoring, and quality metrics
Baseline comparatorCompares candidate against baseline
Report + CI gateBuilds summary and decides pass/fail for release

The more stable these components are, the lower the chance that diff between candidate and baseline is caused by run conditions instead of real behavior change.

How It Works

In practice, eval harness runs as part of release pipeline. Every change goes through the same scenario set.

How one eval harness run works
  • Dataset - fixed set of cases is loaded.
  • Runner - agent runs each case in identical conditions.
  • Evaluators - deterministic checks and, when needed, LLM-as-a-judge scoring are applied.
  • Baseline comparison - candidate is compared with baseline on the same cases.
  • Report - case-level report and overall summary are saved.
  • Gate - CI passes or blocks release based on thresholds.

Eval harness does not replace unit tests. Unit tests validate local components, while harness validates full-system behavior on end-to-end scenarios.

Implementation

In practice, eval harness relies on several simple rules. The examples below are schematic and not tied to one specific framework.

1. Test case structure

PYTHON
case = {
    "id": "price_btc_basic",
    "input": "What is the price of BTC?",
    "expected_tool": "crypto_price_api",
    "checks": ["tool_selection", "valid_output_schema"],
}

Clear cases make regression analysis easier and reduce ambiguity during run review.

2. Runner for case execution

PYTHON
def run_case(agent, case):
    result = agent.run(case["input"])
    return {
        "case_id": case["id"],
        "selected_tool": result.selected_tool,
        "output": result.output,
        "stop_reason": result.stop_reason,
    }

New version and baseline must run in identical conditions: same timeouts, tool mocks, limits, and runtime environment settings.

3. Scoring and baseline comparison

PYTHON
def evaluate_case(run_result, case):
    checks = {
        "tool_selection": run_result["selected_tool"] == case["expected_tool"],
        "valid_output_schema": isinstance(run_result["output"], dict),
    }
    return {"passed": all(checks.values()), "checks": checks}

candidate = run_eval_suite(agent=candidate_agent, dataset=dataset)
baseline = load_baseline_report("reports/baseline.json")
diff = compare(candidate, baseline)

For open tasks, deterministic checks are usually extended by LLM-as-a-judge as a separate scoring layer. Baseline should also be versioned and tied to exact model, prompt, and runtime config.

4. Report and CI gate

PYTHON
summary = build_summary(candidate, diff)

if summary["task_success_rate"] < 0.92:
    fail("gate_failed:task_success_rate")
if summary["hallucination_rate"] > 0.03:
    fail("gate_failed:hallucination_rate")

write_json("reports/eval-summary.json", summary)

Good eval harness always stores artifacts: case-level outputs, failure reasons, diff against baseline, and final summary.

5. Release gate in overall strategy

Release-blocking criteria and CI gate thresholds are described in Testing Strategy, so they are not duplicated in every article.

Typical Mistakes

Unstable dataset

Scenarios keep changing "on the fly", so results from different runs cannot be compared fairly.

Typical cause: dataset is not versioned and does not have fixed case IDs.

Unpinned model version

LLM providers sometimes update models without changing generic name. If version is not pinned (for example, gpt-4o-2024-08-06), results can change between runs.

Typical cause: model alias (gpt-4o, sonnet) is used without version pinning.

Production systems usually pin a concrete model version or snapshot version.

Manual runs instead of automation

Harness is executed only when "there is time", not on every meaningful change.

Typical cause: no CI integration and no clear pass/fail gate.

No comparison with baseline

Team looks only at absolute candidate metrics and misses subtle regressions.

Typical cause: report does not include diff between candidate and baseline.

Mixed deterministic and non-deterministic checks

Deterministic checks and LLM-as-a-judge are merged into one "total score", so it is hard to understand what failed.

Typical cause: no separate scoring sections for different check types.

Missing run artifacts

Only final success percentage is stored, without traces and case-level checks.

Typical cause: harness does not persist detailed outputs into report files.

Unstable eval runs

The same case passes and fails randomly, so teams stop trusting reports.

Typical cause: unstable external environment, missing mocks, floating timeouts, or inconsistent run conditions.

Summary

Quick take
  • Eval harness makes agent testing repeatable and comparable.
  • Release decision should rely on candidate vs baseline diff, not manual examples.
  • Case-level artifacts matter as much as top-level metrics.
  • Without CI gate, eval harness becomes "report for report's sake".

FAQ

Q: Is eval harness just a test suite?
A: No. It is a managed process: dataset, runner, evaluators, baseline comparison, and CI gate.

Q: Can we skip LLM-as-a-judge?
A: Yes, if tasks are well covered by deterministic checks. For open tasks, LLM-as-a-judge is usually added as a separate scoring layer.

Q: How often should eval harness run?
A: At minimum on every change that can affect agent behavior: model, prompts, tools, or runtime rules.

Q: What is most important in first harness version?
A: Stable dataset, saved baseline, clear pass/fail thresholds, and run artifacts.

What Next

For full picture, start with Testing Strategy. Then cover critical logic via Unit Testing, build a stable Golden Datasets, and add Regression Testing for cross-version changes.

When first real incidents appear, add Replay and Debugging and include those cases in your eval harness dataset.

⏱️ 6 min read β€’ Updated March 13, 2026Difficulty: β˜…β˜…β˜†
Integrated: production controlOnceOnly
Add guardrails to tool-calling agents
Ship this pattern with governance:
  • Budgets (steps / spend caps)
  • Tool permissions (allowlist / blocklist)
  • Kill switch & incident stop
  • Idempotency & dedupe
  • Audit logs & traceability
Integrated mention: OnceOnly is a control layer for production agent systems.
Author

This documentation is curated and maintained by engineers who ship AI agents in production.

The content is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Patterns and recommendations are grounded in post-mortems, failure modes, and operational incidents in deployed systems, including during the development and operation of governance infrastructure for agents at OnceOnly.