Idea in 30 seconds
Audit logs are a centralized runtime journal of agent decisions: what happened, why it happened, and who initiated it.
When you need it:
when an agent works with tools, approval, limits, and any incident must be analyzed by facts, not assumptions.
Problem
Without audit logs, the team sees symptoms but not the decision chain. In demos this is barely noticeable. In production, every incident turns into manual guesswork.
Typical outcomes:
- unclear why there was
denyorstop - impossible to reconstruct which exact step produced side effects (state changes)
- hard to explain to customers who changed policy or activated a control and when
Analogy: this is like investigating a crash without camera footage. You see the outcome, but there is no verifiable sequence of events.
And every minute without a quality audit trail extends the incident and increases recovery time.
Solution
The solution is to add a centralized audit layer in runtime that logs both policy decisions and action execution facts.
Each agent step logs a standardized event: decision, reason, action, scope, actor, timestamp.
Runtime needs one decision model:
allowstopapproval_required
It is also important to log not only blocking, but successful execution too. Otherwise incident analysis shows why something was blocked, but not what was actually executed.
Audit logs β debug logs
These solve different tasks:
- Audit logs are a structured and reproducible journal of decisions and actions.
- Debug logs are technical details for local diagnostics.
One without the other is insufficient:
- without audit logs, there is no legally and operationally reliable history of decisions
- without debug logs, local implementation debugging is hard
Example:
- audit:
decision=stop,reason=rate_limited_tenant,tenant_id=t_42,action=crm.search - debug: stack trace, internal retry attempts, latency of individual dependencies
Audit-control components
These components work together at every agent step.
| Component | What it controls | Key mechanics | Why |
|---|---|---|---|
| Event identity | Event uniqueness | run_id + step_idevent timestamp | Allows full sequence reconstruction without gaps |
| Decision context | Reason of policy decision | decision / reasonpolicy layer name | Explains why action executed or was stopped |
| Action context | What exactly the agent did | action + action_keyscope ( user/tenant/global) | Creates linkage between policy and real action |
| Data safety | Sensitive-data leak risk | args hash redaction policy | Preserves audit value without raw secrets and PII |
| Immutable storage | Audit integrity | append-only sink retention + access control | Protects log from silent editing after incidents |
Example alert:
Slack: π Support-Agent decision=stop, reason=approval_required, tenant=t_42, run_id=run_981.
How it looks in architecture
Audit layer sits in runtime loop and records decisions before and after execution of the next agent action.
Each outcome (allow, stop, approval_required) is written to centralized audit trail.
Here, policy layer is a logical runtime layer, not a separate service.
Each step passes through this flow before execution: runtime does not execute action directly until policy returns decision and event is captured in audit.
Flow summary:
- Runtime forms next agent action
- Policy returns
allow,stop, orapproval_required - Runtime logs pre-event with
decisionandreason - if action executed, runtime logs post-event with
result - both event types are searchable for alerting and investigation
Example
Support agent receives a refund.create request.
Policy returns approval_required.
Result:
- execution does not start without approval
- audit contains
decision=approval_required,actor,scope,action_key - after approval, audit contains separate
decision=allowevent and execution result
Audit logs reduce incident investigation time at runtime-step level, not after manual artifact collection.
In code it looks like this
The simplified scheme above shows the main flow. Critical point: audit events must be structured and schema-consistent, otherwise incident search breaks.
Example audit config:
audit:
sink: append_only
retention_days: 180
redact_fields: ["email", "phone", "card_number"]
hash_args: true
sign_events: true
action = planner.next(state)
action_key = make_action_key(action.name, action.args)
decision = policy.evaluate(action, state.user_context)
base_event = {
"run_id": run_id,
"step_id": state.step,
"tenant_id": state.tenant_id,
"action": action.name,
"action_key": action_key,
"timestamp": clock.iso(),
}
audit.log(
**base_event,
phase="pre_exec",
decision=decision.outcome,
reason=decision.reason,
args_hash=hash_args(action.args),
)
if decision.outcome == "approval_required":
# approval resume flow is logged as a separate runtime step:
# approval_required -> approval_granted -> allow -> result
return stop("approval_required")
if decision.outcome == "stop":
return stop(decision.reason)
result = executor.execute(action)
audit.log(
**base_event,
phase="post_exec",
decision=decision.outcome,
reason=decision.reason,
result=result.status,
)
return result
How it looks during execution
Scenario 1: policy stop
- Runtime forms action
crm.search. - Policy returns
stop (reason=rate_limited_tenant). - Runtime writes pre-event to audit.
- Action is not executed.
- Team sees stop reason immediately in logs.
Scenario 2: approval_required
- Runtime forms
refund.create. - Policy returns
approval_required. - Runtime writes pre-event and stops execution.
- After human decision, a separate step starts.
- Audit shows full chain:
approval_required -> allow -> result.
Scenario 3: allow + execution
- Runtime forms next action.
- Policy returns
allow. - Runtime executes action.
- Logs post-event with
result. - Journal contains both decision and execution result.
Common mistakes
- logging only
stopbut not loggingallow - storing raw args without redaction/hash
- no stable
action_keyfor deduplication - mixing audit and debug into one unstructured text stream
- not recording
actorfor policy changes and operator actions - allowing audit events to be edited or deleted retroactively
Result: log exists, but during incident it does not provide a verifiable picture.
Self-check
Quick audit-logging check before production launch:
Progress: 0/8
β Baseline governance controls are missing
Before production, you need at least access control, limits, audit logs, and an emergency stop.
FAQ
Q: How are audit logs different from traces?
A: Trace shows technical execution path, audit log shows policy decisions and actions in terms of who/what/why. For incidents, both are usually needed.
Q: Can we log full args for convenience?
A: Better not. In production, it is safer to store hash or redacted version to avoid leaking secrets and PII.
Q: What is the minimum mandatory field set?
A: At minimum: run_id, step_id, decision, reason, action, action_key, scope, timestamp.
Q: When to write event: before or after execution?
A: Both phases are important: pre-event captures decision, post-event captures fact and result of execution.
Q: Where should audit logs be stored?
A: In centralized append-only storage with controlled access, retention, and fast search by run_id/tenant_id/reason.
Where Audit Logs fit in the system
Audit logs are the base transparency layer in Agent Governance. Together with RBAC, limits, budget controls, approval, and kill switch they provide controllable and explainable agent behavior in production.
Related pages
Next on this topic:
- Agent Governance Overview β overall model of agent control in production.
- Access Control (RBAC) β how to enforce access controls before action execution.
- Human approval β how to govern risky write actions.
- Rate limiting for agents β how to contain retry storms and spikes.
- Rollback strategies β how to safely switch traffic back to stable version.