Replay and Debugging for AI Agents

Replay past agent runs to debug failures and understand why an agent made a specific decision.
On this page
  1. Idea In 30 Seconds
  2. Problem
  3. When To Use
  4. Implementation
  5. How It Works In One Investigation
  6. 1. Store trace with full context
  7. 2. Reproduce trace in same conditions
  8. 3. Analyze step timeline
  9. 4. Capture root cause in structured form
  10. 5. Add incident to regression set
  11. QA / Automation Notes
  12. Typical Mistakes
  13. Incomplete incident trace
  14. Replay in different runtime conditions
  15. Debugging only final text
  16. Root cause is not captured structurally
  17. Case is not added to regression
  18. Summary
  19. FAQ
  20. What Next

Idea In 30 Seconds

Replay for AI agents means taking a real problematic trace, reproducing it in controlled conditions, and finding the failure cause step by step.

Its main value is that the team does not guess incident cause. It sees the full chain of agent decisions and the exact point where behavior broke.

Problem

Without replay, teams often debug "from memory":

  • they look only at final agent response;
  • they do not have full input context;
  • they do not see step-level tool-call results.

In this mode, it is hard to separate symptom from cause. Fixes become inaccurate, and incidents return.

Typical outcomes of this approach:

  • error is fixed locally, but real production scenario is not reproduced;
  • same failure class appears again after release;
  • team loses time on repeated manual investigations.

When To Use

Replay is used when:

  • a production incident happened and root cause must be found;
  • after model or prompt changes, an unexpected behavior diff appears;
  • regression test shows critical-case failure;
  • team must confirm a fix truly closes incident scenario.

Replay is most useful in systems with multi-step agent behavior and external tools.

Implementation

In practice, replay follows one principle: same trace, same run conditions, step-by-step decision analysis. Examples below are schematic and not tied to a specific framework.

How It Works In One Investigation

Short replay investigation cycle
  • Trace - store input, context, steps, and tool responses.
  • Replay - reproduce the same scenario in controlled environment.
  • Step timeline - inspect where agent made wrong decision.
  • Root cause - capture technical cause: prompt, model, tool, or runtime.
  • Fix and verify - apply fix and confirm with rerun.

1. Store trace with full context

PYTHON
trace = {
    "trace_id": "incident-2026-03-11-42",
    "input": "Refund order #8472",
    "conversation_state": {"user_tier": "pro"},
    "steps": [
        {"tool": "payments_api", "args": {"order_id": "8472"}, "result": {"status": "timeout"}},
        {"tool": "fallback_policy", "args": {}, "result": {"action": "ask_for_retry"}}
    ],
    "final_output": "Please try again later.",
    "stop_reason": "fallback_used",
}

Without full trace, replay almost never reproduces the real incident cause.

2. Reproduce trace in same conditions

PYTHON
def replay_trace(agent, trace, runtime_config):
    return agent.replay(
        trace=trace,
        model_version=runtime_config["model_version"],
        tool_mocks=runtime_config["tool_mocks"],
        timeout_sec=runtime_config["timeout_sec"],
    )

If model, timeouts, or tool conditions differ, replay can look falsely safe.

3. Analyze step timeline

PYTHON
def find_first_bad_step(replayed_steps):
    return next(
        ((idx, step) for idx, step in enumerate(replayed_steps) if step["status"] == "unexpected"),
        None,
    )

Core debugging goal is finding the first step where system leaves expected scenario.

4. Capture root cause in structured form

PYTHON
incident_report = {
    "trace_id": "incident-2026-03-11-42",
    "root_cause": "tool_timeout_not_handled_as_retryable",
    "affected_component": "retry_policy",
    "fix_plan": "treat payments timeout as retryable before fallback",
}

Structured root cause makes fix validation and team knowledge transfer easier.

5. Add incident to regression set

PYTHON
def promote_to_regression_case(trace, report):
    return {
        "id": trace["trace_id"],
        "input": trace["input"],
        "expected_behavior": {"stop_reason": "resolved"},
        "tags": ["incident", "replay", report["affected_component"]],
    }

After replay investigation, case should go into regression or golden dataset, otherwise incident can repeat.

QA / Automation Notes

QA teams typically use replay to reproduce production failures in a controlled environment and investigate incident causes step by step.

In practice, this works as an automated cycle: the incident trace is fed into a replay run, results are recorded in an investigation report, and the confirmed case is added to the regression set.

Typical Mistakes

Incomplete incident trace

Logs contain final response, but no agent steps and no tool results.

Typical cause: only summary is stored, without step-level details.

Replay in different runtime conditions

Trace is replayed on different model or with different timeout/retry settings.

Typical cause: incident runtime conditions are not fixed.

Debugging only final text

Team analyzes only last response and misses failure cause in middle of run.

Typical cause: no step-by-step timeline of agent decisions.

Root cause is not captured structurally

After incident, team has verbal conclusion but no clear technical record.

Typical cause: missing incident-report template.

Case is not added to regression

Incident was fixed but not added to permanent test set.

Typical cause: replay investigation is disconnected from regression workflow.

Summary

Quick take
  • Replay and debugging provide reproducible analysis of production incidents.
  • High-quality replay requires the same trace and the same runtime conditions.
  • Debugging should follow agent steps, not only final text.
  • After fix, incident case should be promoted to regression or golden dataset.

FAQ

Q: How is replay different from regression testing?
A: Regression compares system versions on case sets, while replay reproduces one real incident to find root cause.

Q: What is the minimum required for quality replay?
A: input, context state, step-by-step tool calls, their results, stop_reason, and runtime config.

Q: Can replay be done without production API access?
A: Yes. Teams usually use stored responses or mocks to reproduce incident logic in a stable way.

Q: When is a replay case considered closed?
A: When fix passes rerun replay and the same scenario consistently passes in regression set.

What Next

After replay investigations, add incident cases to Golden Datasets and validate them with Regression Testing. Use Eval Harness for standardized runs, and Unit Testing for local logic checks.

⏱️ 5 min read β€’ Updated March 13, 2026Difficulty: β˜…β˜…β˜†
Integrated: production controlOnceOnly
Add guardrails to tool-calling agents
Ship this pattern with governance:
  • Budgets (steps / spend caps)
  • Tool permissions (allowlist / blocklist)
  • Kill switch & incident stop
  • Idempotency & dedupe
  • Audit logs & traceability
Integrated mention: OnceOnly is a control layer for production agent systems.
Author

This documentation is curated and maintained by engineers who ship AI agents in production.

The content is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Patterns and recommendations are grounded in post-mortems, failure modes, and operational incidents in deployed systems, including during the development and operation of governance infrastructure for agents at OnceOnly.