AI Agent Logging (What to Log, What to Redact, What to Alert On)

Agent logging that actually helps in incidents: trace IDs, tool-call events, stop reasons, redaction, and alerting. Includes Python + JS snippets.
On this page
  1. Problem (why you’re here)
  2. Why this fails in production
  3. Diagram: the minimum event pipeline
  4. Real code: instrument the tool gateway (Python + JS)
  5. Real failure (incident-style, with numbers)
  6. Trade-offs
  7. When NOT to do this
  8. Copy-paste checklist
  9. Safe default config snippet (YAML)
  10. Implement in OnceOnly (optional)
  11. FAQ (3–5)
  12. Related pages (3–6 links)

Problem (why you’re here)

In dev, your agent “works”.

In prod, it does something weird once every 200 runs:

  • a customer report says “it emailed the wrong thing”
  • costs spike for 15 minutes
  • the agent loops on a flaky API and times out

And you’ve got
 basically nothing:

  • one “final answer”
  • a few console logs
  • maybe a tool error string without context

So you do the worst kind of debugging: guesswork with a credit card attached.

This page is about logging that makes incidents boring again.

Why this fails in production

Agents fail like distributed systems because they are distributed systems:

  • the model is an unreliable planner
  • tools are side effects (HTTP/DB/ticketing/email)
  • retries and timeouts create emergent behavior

If you don’t log the loop, you can’t answer basic incident questions:

  • What tool calls happened? In what order?
  • What arguments were used (or at least which args-hash)?
  • What did the tool return (or what did we redact)?
  • Why did the run stop (stop_reason)?
  • Which user/request triggered it?

If you’re not logging stop_reason, you’re not “observing” anything. You’re collecting vibes.

Diagram: the minimum event pipeline

Real code: instrument the tool gateway (Python + JS)

Start with the boundary. Tools are where the money and damage live.

We log:

  • run_id, trace_id, tool_name
  • args_hash (not raw args by default)
  • latency + status
  • error_class (normalized)

And we make it hard to “forget” to log by forcing everything through a gateway.

PYTHON
import hashlib
import json
import time
from dataclasses import dataclass
from typing import Any, Callable, Dict, Optional


def stable_hash(obj: Any) -> str:
  raw = json.dumps(obj, sort_keys=True, ensure_ascii=False).encode("utf-8")
  return hashlib.sha256(raw).hexdigest()


@dataclass(frozen=True)
class RunCtx:
  run_id: str
  trace_id: str
  user_id: Optional[str] = None
  request_id: Optional[str] = None


class Logger:
  def event(self, name: str, fields: Dict[str, Any]) -> None: ...


class ToolGateway:
  def __init__(self, *, impls: dict[str, Callable[..., Any]], logger: Logger):
      self.impls = impls
      self.logger = logger

  def call(self, ctx: RunCtx, name: str, args: Dict[str, Any]) -> Any:
      fn = self.impls.get(name)
      if not fn:
          self.logger.event("tool_call", {
              "run_id": ctx.run_id,
              "trace_id": ctx.trace_id,
              "tool": name,
              "args_hash": stable_hash(args),
              "ok": False,
              "error_class": "unknown_tool",
          })
          raise RuntimeError(f"unknown tool: {name}")

      t0 = time.time()
      self.logger.event("tool_call", {
          "run_id": ctx.run_id,
          "trace_id": ctx.trace_id,
          "tool": name,
          "args_hash": stable_hash(args),
      })

      try:
          out = fn(**args)
          self.logger.event("tool_result", {
              "run_id": ctx.run_id,
              "trace_id": ctx.trace_id,
              "tool": name,
              "latency_ms": int((time.time() - t0) * 1000),
              "ok": True,
          })
          return out
      except TimeoutError:
          self.logger.event("tool_result", {
              "run_id": ctx.run_id,
              "trace_id": ctx.trace_id,
              "tool": name,
              "latency_ms": int((time.time() - t0) * 1000),
              "ok": False,
              "error_class": "timeout",
          })
          raise
      except Exception as e:
          self.logger.event("tool_result", {
              "run_id": ctx.run_id,
              "trace_id": ctx.trace_id,
              "tool": name,
              "latency_ms": int((time.time() - t0) * 1000),
              "ok": False,
              "error_class": type(e).__name__,
          })
          raise
JAVASCRIPT
import crypto from "node:crypto";

export function stableHash(obj) {
const raw = JSON.stringify(obj);
return crypto.createHash("sha256").update(raw).digest("hex");
}

export class ToolGateway {
constructor({ impls = {}, logger }) {
  this.impls = impls;
  this.logger = logger;
}

call(ctx, name, args) {
  const fn = this.impls[name];
  const argsHash = stableHash(args);

  if (!fn) {
    this.logger.event("tool_call", {
      run_id: ctx.run_id,
      trace_id: ctx.trace_id,
      tool: name,
      args_hash: argsHash,
      ok: false,
      error_class: "unknown_tool",
    });
    throw new Error("unknown tool: " + name);
  }

  const t0 = Date.now();
  this.logger.event("tool_call", {
    run_id: ctx.run_id,
    trace_id: ctx.trace_id,
    tool: name,
    args_hash: argsHash,
  });

  try {
    const out = fn(args);
    this.logger.event("tool_result", {
      run_id: ctx.run_id,
      trace_id: ctx.trace_id,
      tool: name,
      latency_ms: Date.now() - t0,
      ok: true,
    });
    return out;
  } catch (e) {
    this.logger.event("tool_result", {
      run_id: ctx.run_id,
      trace_id: ctx.trace_id,
      tool: name,
      latency_ms: Date.now() - t0,
      ok: false,
      error_class: e?.name || "Error",
    });
    throw e;
  }
}

If you’re not already doing it, pair this with:

  • budgets (/en/governance/budget-controls)
  • tool dedupe to reduce spam (/en/failures/tool-spam)
  • and unit tests that assert your stop reasons don’t drift (/en/testing-evaluation/unit-testing-agents)

Real failure (incident-style, with numbers)

We shipped a “read-only” research agent that called http.get.

Everything looked fine until an upstream partner API started returning 200s with error payloads (yep). Our tool wrapper treated “200 == ok” and logged only “success”.

Impact:

  • ~18% of runs returned confidently wrong summaries for ~2 hours
  • users filed ~30 tickets
  • on-call time: ~4 hours to confirm it wasn’t “the model hallucinating”

The fix was boring and effective:

  1. log normalized error_class and response validation failures
  2. store args_hash + latency so we could find hot spots
  3. add an alert: validation_fail_rate > 2% for 5 minutes

You don’t need perfect logs. You need logs that answer “what happened?” in under 10 minutes.

Trade-offs

  • Logging raw tool args is useful and also how you leak PII. Default to args_hash.
  • Storing full tool results makes debugging easy and compliance painful. Prefer sampling + redaction.
  • Too much logging is its own outage. Start with events you alert on.

When NOT to do this

  • If the agent runs only in a trusted local environment, you can be lazier (for a while).
  • If you’re still prototyping the loop shape daily, keep logs lightweight but consistent (IDs + stop reasons).
  • Don’t build a custom tracing system if you can’t keep it running. Use something boring.

Copy-paste checklist

  • [ ] run_id, trace_id, request_id, user_id on every event
  • [ ] tool_call + tool_result events (name, args_hash, latency, ok, error_class)
  • [ ] stop_reason + budgets at end of run
  • [ ] Redaction policy (PII, secrets) + default to storing hashes
  • [ ] Alerts: spikes in tool calls/run, timeouts, validation fails
  • [ ] One “incident query” per top failure (saved search / dashboard)

Safe default config snippet (YAML)

YAML
logging:
  ids:
    run_id: required
    trace_id: required
    request_id: required
  tool_calls:
    enabled: true
    store_args: false
    store_args_hash: true
    store_results: "sampled"   # none|sampled|full
    result_sample_rate: 0.01
  pii:
    redact_fields: ["email", "phone", "token", "authorization", "cookie"]
  stop_reasons:
    enabled: true
alerts:
  tool_calls_per_run_p95: { warn: 10, critical: 20 }
  timeout_rate: { warn: 0.02, critical: 0.05 }
  validation_fail_rate: { warn: 0.02, critical: 0.05 }

Implement in OnceOnly (optional)

Implement in OnceOnly
Log tool calls with args hashes + stop reasons (safe by default).
Use in OnceOnly
# onceonly-python: governed audit logs + metrics
import os
from onceonly import OnceOnly

client = OnceOnly(api_key=os.environ["ONCEONLY_API_KEY"])
agent_id = "support-bot"

# Pull last 50 actions (includes args_hash + decisions)
for e in client.gov.agent_logs(agent_id, limit=50):
    print(e.ts, e.tool, e.decision, e.args_hash, e.spend_usd, e.reason)

# Rollups for dashboards/alerts
m = client.gov.agent_metrics(agent_id, period="day")
print("spend_usd=", m.total_spend_usd, "blocked=", m.blocked_actions)

FAQ (3–5)

Should I log raw tool args?
Default to no. Log args_hash + validated, non-sensitive fields. Turn on raw args only for short incident windows with redaction.
What’s the single most useful field?
A stable run_id/trace_id on every event. Without it you can’t reconstruct anything.
How do I detect loops quickly?
Alert on tool_calls/run and repeated args_hash for the same tool. Pair with stop_reason taxonomy (max_steps, max_tool_calls, loop_detected).
Do I need distributed tracing?
If tools cross services, yes. Start simple: trace_id propagation + a few spans around tool calls.

Q: Should I log raw tool args?
A: Default to no. Log args_hash + safe fields. Flip raw args on only for short incident windows (with redaction), then turn it off again.

Q: What’s the single most useful field?
A: A stable run_id/trace_id on every event.

Q: How do I detect loops quickly?
A: Alert on tool_calls/run and repeated (tool, args_hash) within a run. If you haven’t read /failures/tool-spam, do that next.

Q: Do I need distributed tracing?
A: If your tools hit other services, yes. Start with trace IDs + spans around tool calls before going fancy.

⏱ 7 min read ‱ Updated Mar, 2026Difficulty: ★★★
Integrated: production controlOnceOnly
Add guardrails to tool-calling agents
Ship this pattern with governance:
  • Budgets (steps / spend caps)
  • Tool permissions (allowlist / blocklist)
  • Kill switch & incident stop
  • Idempotency & dedupe
  • Audit logs & traceability
Integrated mention: OnceOnly is a control layer for production agent systems.
Author

This documentation is curated and maintained by engineers who ship AI agents in production.

The content is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Patterns and recommendations are grounded in post-mortems, failure modes, and operational incidents in deployed systems, including during the development and operation of governance infrastructure for agents at OnceOnly.