Eval Harness for AI Agents: Repeatable Evaluations

Idea In 30 Seconds

Eval harness is a way to run the same scenario set for an agent, score results with the same rules, and compare candidate against baseline.

Problem

Without eval harness, teams often test agents manually:

run several chat requests;
review a few sample answers;
conclude that a change looks safe.

This does not provide a stable signal: a change can look fine on random examples, yet break critical production scenarios.

Most common outcomes:

impossible to compare candidate and baseline fairly;
difficult to reproduce regressions;
CI has no clear rule for when to block release.

Core Concept / Model

Eval harness is not one test. It is a validation pipeline: fixed dataset, controlled run conditions, scoring, comparison with baseline, and reporting.

Component	What it does
Dataset	Stores stable scenarios and expected behavior
Runner	Runs the agent on each scenario in the same conditions and collects run outputs
Evaluators	Apply deterministic checks, LLM-as-a-judge scoring, and quality metrics
Baseline comparator	Compares `candidate` against `baseline`
Report + CI gate	Builds summary and decides pass/fail for release

The more stable these components are, the lower the chance that diff between candidate and baseline is caused by run conditions instead of real behavior change.

How It Works

In practice, eval harness runs as part of release pipeline. Every change goes through the same scenario set.

How one eval harness run works

Dataset - fixed set of cases is loaded.
Runner - agent runs each case in identical conditions.
Evaluators - deterministic checks and, when needed, LLM-as-a-judge scoring are applied.
Baseline comparison - candidate is compared with baseline on the same cases.
Report - case-level report and overall summary are saved.
Gate - CI passes or blocks release based on thresholds.

Eval harness does not replace unit tests. Unit tests validate local components, while harness validates full-system behavior on end-to-end scenarios.

Implementation

In practice, eval harness relies on several simple rules. The examples below are schematic and not tied to one specific framework.

1. Test case structure

PYTHON

case = {
    "id": "price_btc_basic",
    "input": "What is the price of BTC?",
    "expected_tool": "crypto_price_api",
    "checks": ["tool_selection", "valid_output_schema"],
}

Clear cases make regression analysis easier and reduce ambiguity during run review.

2. Runner for case execution

PYTHON

def run_case(agent, case):
    result = agent.run(case["input"])
    return {
        "case_id": case["id"],
        "selected_tool": result.selected_tool,
        "output": result.output,
        "stop_reason": result.stop_reason,
    }

New version and baseline must run in identical conditions: same timeouts, tool mocks, limits, and runtime environment settings.

3. Scoring and baseline comparison

PYTHON

def evaluate_case(run_result, case):
    checks = {
        "tool_selection": run_result["selected_tool"] == case["expected_tool"],
        "valid_output_schema": isinstance(run_result["output"], dict),
    }
    return {"passed": all(checks.values()), "checks": checks}

candidate = run_eval_suite(agent=candidate_agent, dataset=dataset)
baseline = load_baseline_report("reports/baseline.json")
diff = compare(candidate, baseline)

For open tasks, deterministic checks are usually extended by LLM-as-a-judge as a separate scoring layer. Baseline should also be versioned and tied to exact model, prompt, and runtime config.

4. Report and CI gate

PYTHON

summary = build_summary(candidate, diff)

if summary["task_success_rate"] < 0.92:
    fail("gate_failed:task_success_rate")
if summary["hallucination_rate"] > 0.03:
    fail("gate_failed:hallucination_rate")

write_json("reports/eval-summary.json", summary)

Good eval harness always stores artifacts: case-level outputs, failure reasons, diff against baseline, and final summary.

5. Release gate in overall strategy

Release-blocking criteria and CI gate thresholds are described in Testing Strategy, so they are not duplicated in every article.

Typical Mistakes

Unstable dataset

Scenarios keep changing "on the fly", so results from different runs cannot be compared fairly.

Typical cause: dataset is not versioned and does not have fixed case IDs.

Unpinned model version

LLM providers sometimes update models without changing generic name. If version is not pinned (for example, gpt-4o-2024-08-06), results can change between runs.

Typical cause: model alias (gpt-4o, sonnet) is used without version pinning.

Production systems usually pin a concrete model version or snapshot version.

Manual runs instead of automation

Harness is executed only when "there is time", not on every meaningful change.

Typical cause: no CI integration and no clear pass/fail gate.

No comparison with `baseline`

Team looks only at absolute candidate metrics and misses subtle regressions.

Typical cause: report does not include diff between candidate and baseline.

Mixed deterministic and non-deterministic checks

Deterministic checks and LLM-as-a-judge are merged into one "total score", so it is hard to understand what failed.

Typical cause: no separate scoring sections for different check types.

Missing run artifacts

Only final success percentage is stored, without traces and case-level checks.

Typical cause: harness does not persist detailed outputs into report files.

Unstable eval runs

The same case passes and fails randomly, so teams stop trusting reports.

Typical cause: unstable external environment, missing mocks, floating timeouts, or inconsistent run conditions.

Summary

Quick take

Eval harness makes agent testing repeatable and comparable.
Release decision should rely on candidate vs baseline diff, not manual examples.
Case-level artifacts matter as much as top-level metrics.
Without CI gate, eval harness becomes "report for report's sake".

FAQ

Q: Is eval harness just a test suite?
A: No. It is a managed process: dataset, runner, evaluators, baseline comparison, and CI gate.

Q: Can we skip LLM-as-a-judge?
A: Yes, if tasks are well covered by deterministic checks. For open tasks, LLM-as-a-judge is usually added as a separate scoring layer.

Q: How often should eval harness run?
A: At minimum on every change that can affect agent behavior: model, prompts, tools, or runtime rules.

Q: What is most important in first harness version?
A: Stable dataset, saved baseline, clear pass/fail thresholds, and run artifacts.

What Next

For full picture, start with Testing Strategy. Then cover critical logic via Unit Testing, build a stable Golden Datasets, and add Regression Testing for cross-version changes.

When first real incidents appear, add Replay and Debugging and include those cases in your eval harness dataset.

Eval Harness for AI Agents: Repeatable Evaluations

Idea In 30 Seconds

Problem

Core Concept / Model

How It Works

Implementation

1. Test case structure

2. Runner for case execution

3. Scoring and baseline comparison

4. Report and CI gate

5. Release gate in overall strategy

Typical Mistakes

Unstable dataset

Unpinned model version

Manual runs instead of automation

No comparison with `baseline`

Mixed deterministic and non-deterministic checks

Missing run artifacts

Unstable eval runs

Summary

FAQ

What Next

Used by patterns

Related failures

Governance required

Author

Editorial note

Eval Harness for AI Agents: Repeatable Evaluations

Idea In 30 Seconds

Problem

Core Concept / Model

How It Works

Implementation

1. Test case structure

2. Runner for case execution

3. Scoring and baseline comparison

4. Report and CI gate

5. Release gate in overall strategy

Typical Mistakes

Unstable dataset

Unpinned model version

Manual runs instead of automation

No comparison with baseline

Mixed deterministic and non-deterministic checks

Missing run artifacts

Unstable eval runs

Summary

FAQ

What Next

Used by patterns

Related failures

Governance required

Author

Editorial note

No comparison with `baseline`