Containerizing AI Agents (So They Don’t Die at Deploy Time)

How to containerize and ship AI agents safely: runtime config, secrets, timeouts, health checks, and failure-friendly rollouts. Python + JS examples.
On this page
  1. Problem (your agent worked… until you deployed it)
  2. Why this fails in production
  3. Diagram: what you’re actually deploying
  4. Real code: a container-friendly agent entrypoint (Python + JS)
  5. A sane Dockerfile (multi-stage, no secrets baked in)
  6. Real failure (incident-style, with numbers)
  7. Trade-offs
  8. When NOT to containerize
  9. Copy-paste checklist
  10. Safe default config snippet (YAML)
  11. Implement in OnceOnly (optional)
  12. FAQ (3–5)
  13. Related pages (3–6 links)

Problem (your agent worked… until you deployed it)

The notebook agent is fine.

The deployed agent is where the pain lives:

  • it can’t reach the network you assumed it had
  • it runs out of memory because someone enabled “full trace logging”
  • it retries itself into a rate-limit storm
  • it can’t read secrets because you baked them into the image (please don’t)

Containerizing isn’t “Dockerfile theatre”. It’s where you force the agent to behave like a real service.

Why this fails in production

Agents are awkward workloads:

  • they’re bursty (traffic spikes = token spikes)
  • they do I/O (tools) and hang on timeouts
  • they have long tails (p95 is fine, p99 is chaos)

If your container doesn’t enforce budgets and timeouts at runtime, production will. It’ll just enforce them via 504s, OOMKills, and angry invoices.

Diagram: what you’re actually deploying

Real code: a container-friendly agent entrypoint (Python + JS)

We keep it boring:

  • read config from env
  • enforce budgets/timeouts
  • expose a health endpoint
PYTHON
import os
import time
from dataclasses import dataclass
from typing import Any, Dict


@dataclass(frozen=True)
class Budgets:
  max_steps: int
  max_tool_calls: int
  max_seconds: int


def load_budgets() -> Budgets:
  return Budgets(
      max_steps=int(os.getenv("AGENT_MAX_STEPS", "25")),
      max_tool_calls=int(os.getenv("AGENT_MAX_TOOL_CALLS", "12")),
      max_seconds=int(os.getenv("AGENT_MAX_SECONDS", "60")),
  )


def run_request(task: str, *, budgets: Budgets) -> Dict[str, Any]:
  t0 = time.time()
  steps = 0
  tool_calls = 0

  while True:
      steps += 1
      if steps > budgets.max_steps:
          return {"output": "", "stop_reason": "max_steps"}
      if tool_calls > budgets.max_tool_calls:
          return {"output": "", "stop_reason": "max_tool_calls"}
      if time.time() - t0 > budgets.max_seconds:
          return {"output": "", "stop_reason": "max_seconds"}

      # ... agent loop ...
      return {"output": "ok", "stop_reason": "finish"}


def health() -> Dict[str, str]:
  return {"ok": "true"}
JAVASCRIPT
export function loadBudgets() {
return {
  maxSteps: Number(process.env.AGENT_MAX_STEPS ?? 25),
  maxToolCalls: Number(process.env.AGENT_MAX_TOOL_CALLS ?? 12),
  maxSeconds: Number(process.env.AGENT_MAX_SECONDS ?? 60),
};
}

export function runRequest(task, { budgets }) {
const t0 = Date.now();
let steps = 0;
let toolCalls = 0;

while (true) {
  steps += 1;
  if (steps > budgets.maxSteps) return { output: "", stop_reason: "max_steps" };
  if (toolCalls > budgets.maxToolCalls) return { output: "", stop_reason: "max_tool_calls" };
  if ((Date.now() - t0) / 1000 > budgets.maxSeconds) return { output: "", stop_reason: "max_seconds" };

  // ... agent loop ...
  return { output: "ok", stop_reason: "finish" };
}
}

export function health() {
return { ok: true };
}

A sane Dockerfile (multi-stage, no secrets baked in)

DOCKERFILE
FROM node:20-alpine AS deps
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci

FROM node:20-alpine AS runner
WORKDIR /app
ENV NODE_ENV=production
COPY --from=deps /app/node_modules ./node_modules
COPY . .
EXPOSE 3000
CMD ["npm","run","start"]

Key points:

  • configs come from env (budgets, tool allowlists, model selection)
  • secrets come from your platform (Vercel/K8s/Secrets Manager), not your image
  • health check exists, so rollouts can be safe

Real failure (incident-style, with numbers)

We deployed an agent service with “debug logging” turned on by default. It logged full tool results for every call.

Impact in one afternoon:

  • memory usage climbed until the container OOMKilled
  • retries amplified load (clients retried, agent retried tools)
  • ~12% request failure rate
  • on-call: ~3 hours (because the logs were huge and still not useful)

Fix:

  1. default to sampled logging + redaction (/observability-monitoring/agent-logging)
  2. cap budgets at runtime (max seconds + max tool calls)
  3. add a kill switch config to disable expensive tools during incidents

Trade-offs

  • Tight timeouts reduce tail latency and can reduce answer quality.
  • More logging helps debugging and hurts cost/privacy. Default to less.
  • “One container per agent” is simple and expensive. Shared services are cheaper and harder.

When NOT to containerize

If you’re not operating this as a service (no traffic, no SLOs), don’t overbuild. But once a real user can trigger it, you are operating a service. Congrats.

Copy-paste checklist

  • [ ] Budgets loaded from env and enforced at runtime
  • [ ] Tool gateway enforces timeouts/retries/allowlists
  • [ ] Health endpoint + readiness checks
  • [ ] Secrets injected by platform (not baked)
  • [ ] Kill switch config (disable tools / disable writes)
  • [ ] Logs are structured and sampled; PII redaction on by default

Safe default config snippet (YAML)

YAML
runtime:
  env:
    AGENT_MAX_STEPS: 25
    AGENT_MAX_TOOL_CALLS: 12
    AGENT_MAX_SECONDS: 60
tools:
  allowlist: ["search.read", "http.get"]
  timeouts_ms: { default: 8000 }
  retries: { max: 2, backoff_ms: [200, 800] }
observability:
  sampled_tool_results: true
  result_sample_rate: 0.01
rollout:
  canary_percent: 10
  rollback_on_error_rate: 0.05

Implement in OnceOnly (optional)

Implement in OnceOnly
Budgets + tool gateway defaults that survive deployment.
Use in OnceOnly
# onceonly-python: tool allowlist + governed tool call
import os
from onceonly import OnceOnly

client = OnceOnly(
    api_key=os.environ["ONCEONLY_API_KEY"],
    timeout=5.0,
    max_retries_429=2,
)

agent_id = "billing-agent"

client.gov.upsert_policy({
    "agent_id": agent_id,
    "allowed_tools": ["search.read", "http.get"],
    "max_actions_per_hour": 200,
    "max_spend_usd_per_day": 10.0,
})

res = client.ai.run_tool(
    agent_id=agent_id,
    tool="http.get",
    args={"url": "https://example.com/health"},
    spend_usd=0.001,
)
if not res.allowed:
    raise RuntimeError(res.policy_reason)

FAQ (3–5)

Should I bake prompts/models into the image?
Bake code, not secrets. Prompts can be in the repo (versioned). Models are config. Secrets are runtime-only.
What’s the most common deploy failure?
Timeouts + retries interacting badly. You get 504s, then storms. Put retries in one place and cap budgets.
Do I need Kubernetes for this?
Not necessarily. You need budgets, observability, and rollback. You can do that on simpler platforms too.
How do I roll back safely?
Have a kill switch and a previous image/prompt version ready. Roll back on error-rate + spend spikes.

Not sure this is your use case?

Design your agent ->
⏱️ 5 min readUpdated Mar, 2026Difficulty: ★★★
Integrated: production controlOnceOnly
Add guardrails to tool-calling agents
Ship this pattern with governance:
  • Budgets (steps / spend caps)
  • Tool permissions (allowlist / blocklist)
  • Kill switch & incident stop
  • Idempotency & dedupe
  • Audit logs & traceability
Integrated mention: OnceOnly is a control layer for production agent systems.
Author

This documentation is curated and maintained by engineers who ship AI agents in production.

The content is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Patterns and recommendations are grounded in post-mortems, failure modes, and operational incidents in deployed systems, including during the development and operation of governance infrastructure for agents at OnceOnly.