Idea in 30 seconds
Rollback strategies are a runtime mechanism for quickly returning traffic to a stable version when a new release degrades metrics.
When you need it: when an agent is released via canary/rollout and any production regression must be stopped without long downtime.
Problem
Without rollback, the team sees the problem but cannot remove its impact fast. While analysis is ongoing, traffic keeps going to the degraded version and the incident grows.
Typical scenario:
error_rateorlatency_p95grows- users keep landing on the problematic version
- team performs manual actions under time pressure
Analogy: this is like driving without an emergency brake. When the system is already skidding, slow reaction costs much more than the actual fix.
And every minute without rollback adds new errors, cost, and trust loss.
Solution
The solution is to make rollback a separate policy layer in runtime release flow. Policy checks degradation signals and decides whether to continue rollout or switch traffic back to stable version.
Rollback policy layer returns a technical decision: allow or stop with reason:
rollback_requiredsla_breacherror_spike
On stop, the system executes controlled traffic switch to active stable version and records the event in audit log.
This is a dedicated emergency control, not manual improvisation during incident.
Rollback β kill switch
These are different tools:
- Rollback returns to the previous stable version.
- Kill switch stops actions or traffic without changing version.
One without the other is not enough:
- without rollback, restoring normal operation after release regression is hard
- without kill switch, it is hard to quickly suppress risky actions until rollback completes
Example:
- rollback:
2.4.0 -> 2.3.3aftererror_spike - kill switch: temporary
writes_disabled=truewhile stable version is restored
Rollback-control components
These components work together during each rollout.
| Component | What it controls | Key mechanics | Why |
|---|---|---|---|
| Rollback triggers | When rollback is needed | error_rate / latency_p95SLO thresholds | Gives clear non-manual trigger criteria |
| Traffic switch | Traffic switching | from_version -> to_version stable fallback | Quickly reduces degradation impact |
| Rollout gate | Further candidate rollout | gate lock rollback window | Prevents sending traffic again to broken version |
| Recovery verification | Whether system recovered after rollback | post-rollback checks stability window | Confirms rollback actually solved the problem |
| Rollback observability | Transparency of emergency actions | audit logs alerts on rollback events | Does not execute rollback directly, but gives full decision chain |
Example alert:
Slack: π Rollback triggered support-agent@2.4.0 -> 2.3.3, reason=error_spike, stage=canary.
How it looks in architecture
Rollback policy layer sits between release runtime and traffic, and blocks degraded rollout before it scales.
Every decision (allow or stop) is recorded in audit log.
Each rollout stage passes this flow before traffic expansion: runtime does not scale candidate directly, it first asks policy layer for a decision.
Flow summary:
- Monitoring emits degradation signal
- Policy checks
error_rate,latency_p95,tool_failures,rollback_plan allow-> rollout continuesstop-> traffic switches to active stable version- both decisions are written to audit log
Example
After release support-agent@2.4.0, tool_failure_rate grows in canary.
Rollback policy returns stop (reason=rollback_required).
Result:
- traffic returns to
2.3.3 - candidate is locked for further expansion
- team investigates root cause without active incident pressure
Rollback reduces incident damage during the incident, not after it scales.
In code it looks like this
The simplified scheme above shows the main flow. Critical point: rollback must be idempotent and fast, so repeated signals do not break traffic switch.
Example rollback config:
rollback:
stable_version: support-agent@2.3.3
candidate_version: support-agent@2.4.0
triggers:
error_rate_p95: 0.05
latency_p95_ms: 1800
tool_failure_rate: 0.03
lock_candidate_after_rollback: true
release_cfg = load_release_config("support-agent")
signal = monitor.read(version_id=release_cfg.candidate_version)
decision = rollback_policy.check(signal, release_cfg)
if decision.outcome == "stop":
switch_result = traffic.switch(
from_version=release_cfg.candidate_version,
to_version=release_cfg.stable_version,
)
if release_cfg.lock_candidate_after_rollback:
rollout.lock(version_id=release_cfg.candidate_version)
audit.log(
run_id,
decision=decision.outcome,
reason=decision.reason,
from_version=release_cfg.candidate_version,
to_version=release_cfg.stable_version,
switch_status=switch_result.status,
)
alerts.notify_if_needed(release_cfg.candidate_version, decision.reason)
return stop(
decision.reason,
from_version=release_cfg.candidate_version,
to_version=release_cfg.stable_version,
)
allow_decision = Decision.allow(reason=None) # standard allow outcome/reason model
audit.log(
run_id,
decision=allow_decision.outcome,
reason=allow_decision.reason,
version_id=release_cfg.candidate_version,
stage="canary",
)
return continue_rollout()
How it looks during execution
Scenario 1: rollback_required in canary
- Candidate version receives 5% traffic.
- Metrics breach thresholds (
error_rateandtool_failure_rate). - Policy returns
stop (reason=rollback_required). - Traffic returns to stable version.
- Candidate is locked until incident review is done.
Scenario 2: false alarm, rollback not needed
- Monitoring emits a short spike, but thresholds are not breached.
- Policy returns
allow. - Rollout continues on current stage.
- Event is written to audit log.
- System remains stable without unnecessary rollback.
Scenario 3: repeated signal after rollback
- After rollback, another alert arrives with the same signal.
- Idempotent logic does not execute repeated switch.
- Policy returns technical status without duplicate actions.
- Logs show rollback was already applied.
- Team works on root cause without additional noise.
Common mistakes
- rollback is only manual, without policy triggers
- no stable version exists for quick return
- candidate is not locked after rollback
- rollback switches traffic but does not verify recovery metrics
- no idempotency for repeated rollback signals
- audit log does not include from/to version and reason
Result: rollback appears to exist, but during incident it is slow and unpredictable.
Self-check
Quick rollback-strategy check before production launch:
Progress: 0/8
β Baseline governance controls are missing
Before production, you need at least access control, limits, audit logs, and an emergency stop.
FAQ
Q: When to trigger rollback automatically, and when manually?
A: For clear SLO thresholds, automatic is better. For ambiguous cases (for example risky business operation), use human approval on top of policy.
Q: Do rollback and kill switch duplicate each other?
A: No. Rollback restores stable version, kill switch quickly limits actions. In production they should work together.
Q: What to do after rollback?
A: Capture incident snapshot (version, reason, metrics), lock candidate, and restart rollout only after fix and verification.
Q: Is rollback needed if canary exists?
A: Yes. Canary only reduces blast radius, rollback is needed to quickly restore stable state.
Q: Which fields must be logged?
A: reason, from_version, to_version, stage, switch_status, timestamp, actor (if rollback is manual).
Where Rollback fits in the system
Rollback strategies are one of Agent Governance layers. Together with versioning, limits, budgets, approval, and audit, they form a system of safe production changes.
Related pages
Next on this topic:
- Agent Governance Overview β overall model of agent control in production.
- Agent Versioning β how to control prompt/tools/policy changes before rollback.
- Kill switch β how to instantly limit actions during incident.
- Rate limiting for agents β how to contain spikes during degradation.
- Audit logs for agents β how to reconstruct rollback decision chains.