No Monitoring (Anti-Pattern) + What to Log + Code

  • Recognize the trap before it ships to prod.
  • See what breaks when the model is confidently wrong.
  • Copy safer defaults: permissions, budgets, idempotency.
  • Know when you shouldn’t use an agent at all.
Detection signals
  • Tool calls per run spikes (or repeats with same args hash).
  • Spend or tokens per request climbs without better outputs.
  • Retries shift from rare to constant (429/5xx).
If you can’t answer 'what did the agent do?' you can’t run it in production. The minimum stack: traces, stop reasons, spend, and tool call logs.
On this page
  1. Problem-first intro
  2. The 03:00 moment
  3. Why this fails in production
  4. 1) Agents are distributed systems with extra steps
  5. 2) “Success rate” hides the interesting failures
  6. 3) You can’t fix what you can’t replay
  7. 4) Monitoring is part of governance
  8. Hard invariants (non-negotiables)
  9. Implementation example (real code)
  10. Example failure case (concrete)
  11. 🚨 Incident: “Everything is slow” (and we didn’t know why)
  12. Dashboards + alerts (examples you can steal)
  13. PromQL examples (Grafana)
  14. SQL example (Postgres/BigQuery-style)
  15. Alert rules (plain English)
  16. Trade-offs
  17. When NOT to use
  18. Copy-paste checklist
  19. Safe default config
  20. FAQ
  21. Related pages
  22. Production takeaway
  23. What breaks without this
  24. What works with this
  25. Minimum to ship
Quick take

Quick take: Without observability, every agent failure becomes “the model was weird” — an untestable, unfixable diagnosis. You need: tool call traces, stop reasons, cost tracking, and replay capability. This isn’t optional infrastructure.

You'll learn: Minimum monitoring requirements • One unified event schema • Stop reason taxonomy • Replay basics • A concrete incident you can recognize

Concrete metric

Without monitoring: users report issues first • debugging by vibes • no replay
With minimal monitoring: detect drift early • debug via traces + stop reasons • replay last runs
Impact: faster incident response + fewer repeated failures (because you can fix root cause)


Problem-first intro

An agent run goes wrong.

A user reports: “it sent the wrong email.”

You open logs and you have:

  • The final answer text (maybe)
  • A stack trace (maybe)
  • Vibes (definitely)

If you can’t answer these five questions, the system isn’t operable:

Incident questions
  1. Which tools were called (and in what order)?
  2. With what arguments (or at least args hashes)?
  3. What came back (or at least snapshot hashes)?
  4. Which version of model/prompt/tools was running?
  5. Why did it stop?
Truth

That’s not “missing dashboards”. That’s “this system is not operable”.

The 03:00 moment

This is what “no monitoring” feels like:

TEXT
03:12 — Support: "Agent emailed the wrong customer. Please stop it."

You grep logs and find… nothing you can join.

TEXT
2026-02-07T03:11:58Z INFO sent email to customer@example.com
2026-02-07T03:11:59Z INFO sent email to customer@example.com
2026-02-07T03:12:01Z WARN http.get 429
2026-02-07T03:12:03Z INFO Agent completed task

No run_id. No step trace. No stop reason. No tool args hash. No model/tool version.

So you do the worst kind of debugging: grep by an email address and pray it’s unique.


Why this fails in production

Failure analysis

1) Agents are distributed systems with extra steps

Once an agent calls tools, you’ve built:

  • multiple dependencies (HTTP, DB, APIs)
  • multiple failure modes (timeouts, 502s, rate limits)
  • multiple retries (and retry storms)

If you don’t log each step, you’re debugging by storytelling.

2) “Success rate” hides the interesting failures

Drift shows up as:

  • higher tool calls per run (looping, not failing)
  • higher tokens per request (the model “explaining” errors)
  • longer latency (retries, slow tools)
  • different stop reasons (budgets, denials, timeouts)

3) You can’t fix what you can’t replay

If you can’t replay or even reconstruct a run from logs, you can’t trust a “fix”. You’ll just be guessing.

4) Monitoring is part of governance

Budgets, allowlists, and kill switches are useless if you can’t see when they triggered.


Hard invariants (non-negotiables)

  • Every run has a run_id.
  • Every step has a step_id.
  • Every tool call logs: tool name, args hash, duration, status, error class.
  • Every run ends with a stop event: stop_reason.
  • If you can’t replay (even partially), you can’t trust a fix.

Implementation example (real code)

The common failure here is having two different log formats:

  • tool events are structured
  • stop events are “special”

That kills joinability.

This sample uses one unified event schema for tool calls and stop events.

PYTHON
from __future__ import annotations

from dataclasses import dataclass, asdict
import hashlib
import json
import time
from typing import Any, Literal


EventKind = Literal["tool_result", "stop"]


def sha(obj: Any) -> str:
  raw = json.dumps(obj, sort_keys=True, separators=(",", ":"), ensure_ascii=False).encode("utf-8")
  return hashlib.sha256(raw).hexdigest()[:24]


@dataclass(frozen=True)
class Event:
  run_id: str
  kind: EventKind
  ts_ms: int

  # optional fields
  step_id: int | None = None
  tool: str | None = None
  args_sha: str | None = None
  duration_ms: int | None = None
  status: Literal["ok", "error"] | None = None
  error: str | None = None

  stop_reason: str | None = None
  usage: dict[str, Any] | None = None


def log_event(ev: Event) -> None:
  print(json.dumps(asdict(ev), ensure_ascii=False))


def call_tool(run_id: str, step_id: int, tool: str, args: dict[str, Any]) -> Any:
  started = time.time()
  try:
      out = tool_impl(tool, args=args)  # (pseudo)
      dur = int((time.time() - started) * 1000)
      log_event(
          Event(
              run_id=run_id,
              kind="tool_result",
              ts_ms=int(time.time() * 1000),
              step_id=step_id,
              tool=tool,
              args_sha=sha(args),
              duration_ms=dur,
              status="ok",
              error=None,
          )
      )
      return out
  except Exception as e:
      dur = int((time.time() - started) * 1000)
      log_event(
          Event(
              run_id=run_id,
              kind="tool_result",
              ts_ms=int(time.time() * 1000),
              step_id=step_id,
              tool=tool,
              args_sha=sha(args),
              duration_ms=dur,
              status="error",
              error=type(e).__name__,
          )
      )
      raise


def stop(run_id: str, *, reason: str, usage: dict[str, Any]) -> dict[str, Any]:
  log_event(
      Event(
          run_id=run_id,
          kind="stop",
          ts_ms=int(time.time() * 1000),
          stop_reason=reason,
          usage=usage,
      )
  )
  return {"status": "stopped", "stop_reason": reason, "usage": usage}
JAVASCRIPT
import crypto from "node:crypto";

export function sha(obj) {
const raw = JSON.stringify(obj, Object.keys(obj || {}).sort());
return crypto.createHash("sha256").update(raw, "utf8").digest("hex").slice(0, 24);
}

export function logEvent(ev) {
console.log(JSON.stringify(ev));
}

export async function callTool(runId, stepId, tool, args) {
const started = Date.now();
try {
  const out = await toolImpl(tool, { args }); // (pseudo)
  logEvent({
    run_id: runId,
    kind: "tool_result",
    ts_ms: Date.now(),
    step_id: stepId,
    tool,
    args_sha: sha(args),
    duration_ms: Date.now() - started,
    status: "ok",
    error: null,
  });
  return out;
} catch (e) {
  logEvent({
    run_id: runId,
    kind: "tool_result",
    ts_ms: Date.now(),
    step_id: stepId,
    tool,
    args_sha: sha(args),
    duration_ms: Date.now() - started,
    status: "error",
    error: e?.name || "Error",
  });
  throw e;
}
}

export function stop(runId, { reason, usage }) {
logEvent({
  run_id: runId,
  kind: "stop",
  ts_ms: Date.now(),
  stop_reason: reason,
  usage,
});
return { status: "stopped", stop_reason: reason, usage };
}

Example failure case (concrete)

Incident

🚨 Incident: “Everything is slow” (and we didn’t know why)

Date: 2024-10-08
Duration: 3 days unnoticed, ~2 hours to debug once we added visibility
System: Customer support agent


What actually happened

The http.get tool started returning intermittent 429s/503s.

Our tool layer retried up to 8× per call (previously 2×) without jitter. The agent interpreted those failures as “try a different query” and ended up doing more tool calls per run.

Over 3 days (illustrative numbers, but this pattern is common):

  • avg tool calls/run: 4.3 → 11.7
  • p95 latency: 2.1s → 8.4s
  • spend/run: ~2×

Nothing “crashed”. Success rate stayed ~91%, so the drift looked like “users are impatient” until support escalated.


Root cause (the boring version)

  • retries + no jitter → thundering herd
  • no stop reasons in logs → “success” masked drift
  • no tool-call trace → we couldn’t prove where time/spend went

Fix

  1. Structured event logs (run_id, step_id, tool, args hash, duration, status)
  2. Stop reasons surfaced to the caller/UI
  3. Dashboards + alerts on drift signals (tool calls/run, latency P95, stop reasons)

Dashboards + alerts (examples you can steal)

You don’t need perfect observability. You need useful observability.

PromQL examples (Grafana)

PROMQL
# Tool calls per run (p95)
histogram_quantile(0.95, sum(rate(agent_tool_calls_bucket[5m])) by (le))

# Stop reasons over time
sum(rate(agent_stop_total[10m])) by (stop_reason)

# Latency p95
histogram_quantile(0.95, sum(rate(agent_run_latency_ms_bucket[5m])) by (le))

SQL example (Postgres/BigQuery-style)

SQL
-- Alert: tool_calls/run spike vs baseline
SELECT
  date_trunc('hour', created_at) AS hour,
  avg(tool_calls) AS avg_tool_calls
FROM agent_runs
WHERE created_at > now() - interval '7 days'
GROUP BY 1
HAVING avg(tool_calls) > 2 * (
  SELECT avg(tool_calls)
  FROM agent_runs
  WHERE created_at BETWEEN now() - interval '14 days' AND now() - interval '7 days'
);

Alert rules (plain English)

  • If tool_calls_per_run_p95 is 2× baseline for 10 minutes → investigate (and consider kill writes).
  • If stop_reason=loop_detected appears above baseline → investigate (tool spam / bad prompt / outage).
  • If stop_reason=tool_timeout spikes → you have upstream issues, not “model weirdness”.

Trade-offs

Trade-offs
  • Logging costs money (storage, indexing). Still cheaper than blind incidents.
  • You must avoid logging raw PII/secrets. Hash args and redact aggressively.
  • Replay requires retention policy + access controls.

When NOT to use

Don’t
  • Don’t build a heavy tracing platform before you have structured logs. Start small.
  • Don’t log raw tool args if they contain PII/secrets. Ever.
  • Don’t ship agents without stop reasons. You’re creating retry loops.

Copy-paste checklist

Production checklist
  • [ ] run_id / step_id for every run
  • [ ] Unified event schema (tool results + stop events)
  • [ ] Tool-call logs: tool, args_hash, duration, status, error class
  • [ ] Stop reason returned to user + logged
  • [ ] Tokens/tool calls/spend per run metrics
  • [ ] Dashboards: latency P95, tool_calls/run, stop_reason distribution
  • [ ] Replay data: snapshot hashes (with retention + access control)

Safe default config

YAML
logging:
  events:
    enabled: true
    schema: "unified"
    store_args: false
    store_args_hash: true
    include: ["run_id", "step_id", "tool", "duration_ms", "status", "error", "stop_reason"]
metrics:
  track: ["tokens_per_request", "tool_calls_per_run", "latency_p95", "spend_per_run", "stop_reason"]
retention:
  tool_snapshot_days: 14
  logs_days: 30

FAQ

FAQ
What's the minimum monitoring we need?
Tool call logs + stop reasons + basic usage metrics. If you can’t answer “what did it do?”, you can’t run it.
Can we log raw tool args?
Usually no. Hash args, redact aggressively, and store raw only in tightly controlled systems if you must.
Do we need distributed tracing?
Eventually. Start with structured logs that include run_id, step_id, and durations. That gets you most of the value.
How do we monitor drift?
Watch tokens, tool calls, latency, and stop reasons. They move before correctness complaints.

Related

Production takeaway

Production takeaway

What breaks without this

  • ❌ You can’t explain incidents
  • ❌ Drift looks like “model weirdness”
  • ❌ Cost overruns show up after the fact

What works with this

  • ✅ You can join, replay, and debug runs
  • ✅ Drift becomes a graph, not a debate
  • ✅ Kill switches trigger based on real signals

Minimum to ship

  1. Unified structured logs
  2. Stop reasons
  3. Basic metrics + dashboards
  4. Alerts on drift

Not sure this is your use case?

Design your agent ->
⏱️ 9 min readUpdated Mar, 2026Difficulty: ★★★
Implement in OnceOnly
Safe defaults for tool permissions + write gating.
Use in OnceOnly
# onceonly guardrails (concept)
version: 1
tools:
  default_mode: read_only
  allowlist:
    - search.read
    - kb.read
    - http.get
writes:
  enabled: false
  require_approval: true
  idempotency: true
controls:
  kill_switch: { enabled: true, mode: disable_writes }
audit:
  enabled: true
Integrated: production controlOnceOnly
Add guardrails to tool-calling agents
Ship this pattern with governance:
  • Budgets (steps / spend caps)
  • Tool permissions (allowlist / blocklist)
  • Kill switch & incident stop
  • Idempotency & dedupe
  • Audit logs & traceability
Integrated mention: OnceOnly is a control layer for production agent systems.
Author

This documentation is curated and maintained by engineers who ship AI agents in production.

The content is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Patterns and recommendations are grounded in post-mortems, failure modes, and operational incidents in deployed systems, including during the development and operation of governance infrastructure for agents at OnceOnly.