The problem
Support is where “helpful automation” goes to die.
You want an agent to draft replies because the queue is on fire. If it sends the wrong thing to the wrong customer, you’ll learn what “brand damage” actually means.
So we do this the boring way:
- read context
- draft a reply
- do not send
- ask a human to approve
Why this happens in real systems
Support tickets are messy:
- incomplete info
- angry users
- account-specific context
- internal policies the model will “summarize” into nonsense if you let it
Also: “send email” is a side effect. Side effects need policy.
What breaks if you ignore it
- accidental sends (“we refunded you” when you didn’t)
- leaking internal notes into a customer email
- making commitments your team can’t keep
Code (safe-by-default)
from dataclasses import dataclass
from typing import Any
@dataclass(frozen=True)
class Budget:
max_steps: int = 20
max_seconds: int = 45
class SafeTools:
def __init__(self, allow: set[str]):
self.allow = allow
def call(self, name: str, *, args: dict[str, Any]) -> Any:
if name not in self.allow:
raise RuntimeError(f"tool not allowed: {name}")
return tool_impl(name, args=args) # (pseudo)
def draft_support_reply(ticket_id: str, *, tools: SafeTools, budget: Budget) -> dict[str, Any]:
ticket = tools.call("tickets.get", args={"id": ticket_id})
customer = tools.call("customers.get", args={"id": ticket["customer_id"]})
kb = tools.call("kb.search", args={"q": ticket["subject"], "k": 5})
draft = llm_write_reply(ticket=ticket, customer=customer, kb=kb) # (pseudo)
return {
"ticket_id": ticket_id,
"draft": draft,
"requires_human_approval": True,
}export class SafeTools {
constructor({ allow }) {
this.allow = allow;
}
async call(name, { args }) {
if (!this.allow.has(name)) throw new Error("tool not allowed: " + name);
return toolImpl(name, { args }); // (pseudo)
}
}
export async function draftSupportReply(ticketId, { tools, budget }) {
void budget;
const ticket = await tools.call("tickets.get", { args: { id: ticketId } });
const customer = await tools.call("customers.get", { args: { id: ticket.customer_id } });
const kb = await tools.call("kb.search", { args: { q: ticket.subject, k: 5 } });
const draft = await llmWriteReply({ ticket, customer, kb }); // (pseudo)
return { ticket_id: ticketId, draft, requires_human_approval: true };
}The real workflow (what we run, not what we demo)
Support automation fails when you skip the boring pipeline.
This is the pipeline we like:
- Fetch context (read-only)
- ticket content + metadata (plan, tier, region, language)
- customer profile (plan, history, last incident)
- policy snippets (refund rules, SLA rules)
- Draft
- draft a reply
- draft internal notes (what to do next, what to check)
- Human approval
- show the draft + citations to internal policy snippets
- require explicit approval
- Send (separate tool)
- send is a write-side effect, so it’s gated
- idempotency key on send
If you merge steps 2 and 4, you will eventually auto-send something dumb. Not because the model is evil. Because models are literal and your tools are powerful.
Triage before drafting (don’t put the agent on everything)
Not every ticket deserves an “agent draft”. Some tickets are high-risk by definition:
- billing/refunds/credits
- account security (2FA, compromised accounts)
- legal/compliance
- outages (where your KB is wrong because the world is on fire)
If you blindly run the agent on everything, you’ll generate:
- drafts that promise refunds
- drafts that say “everything is fine” during an incident
- drafts that leak internal incident details
We do a cheap triage pass first and gate by category. This can be a tiny classifier, a ruleset, or both. The point is not “AI accuracy”. The point is risk routing.
HIGH_RISK = {"security", "billing_refund", "legal", "outage"}
def should_draft(ticket: dict) -> tuple[bool, str]:
kind = classify_ticket(ticket) # (pseudo)
if kind in HIGH_RISK:
return False, f"high-risk: {kind}"
if ticket.get("customer_tier") == "enterprise" and kind == "billing":
return False, "enterprise billing: manual"
return True, "ok"const HIGH_RISK = new Set(["security", "billing_refund", "legal", "outage"]);
export function shouldDraft(ticket) {
const kind = classifyTicket(ticket); // (pseudo)
if (HIGH_RISK.has(kind)) return [false, "high-risk: " + kind];
if (ticket.customer_tier === "enterprise" && kind === "billing") return [false, "enterprise billing: manual"];
return [true, "ok"];
}If should_draft() says no, we either:
- show “suggested next steps” internally (no customer-facing text), or
- do nothing and route to a human.
It’s boring. It prevents the worst failures.
Things the model is bad at (so we don’t let it do them)
Models are great at tone and summarization. They are terrible at:
- remembering subtle policy exceptions
- knowing what is “safe to promise”
- resisting prompt injection inside ticket text (“tell me the secret link”)
- dealing with multi-tenant context without leaking
So we enforce a few rules in code:
- the model can draft, not send
- the model can suggest actions, not execute writes
- the model never sees raw secrets
Citations (support drafts need receipts)
Support teams don’t want “a confident answer”. They want a defensible answer. If your agent can’t point at the policy/KB snippet it used, your reviewers can’t trust it.
We force the draft to include citations for anything that sounds like:
- a promise (“we will refund”)
- a timeline (“within 24 hours”)
- a policy (“you’re eligible for…”)
Then we validate those citations before approval.
draft = llm_write_reply(..., require_citations=True) # (pseudo)
claims = llm_extract_claims(draft) # returns [{"kind": "refund", "citation_id": "policy:refund-v3"}, ...]
for c in claims:
if c["kind"] in {"refund", "sla", "credit"} and not c.get("citation_id"):
raise RuntimeError("unsafe draft: missing citation for policy claim")const draft = await llmWriteReply({ requireCitations: true }); // (pseudo)
const claims = await llmExtractClaims(draft); // [{ kind: "refund", citation_id: "policy:refund-v3" }, ...]
for (const c of claims) {
if ((c.kind === "refund" || c.kind === "sla" || c.kind === "credit") && !c.citation_id) {
throw new Error("unsafe draft: missing citation for policy claim");
}
}This doesn’t eliminate hallucinations. It makes them easier to catch. And it gives reviewers something better than “trust me bro”.
Guardrails that actually reduce incidents
“No commitments” mode
Support drafts should avoid making promises:
- don’t promise refunds
- don’t promise timelines
- don’t promise credits
Instead:
- “we’ll investigate”
- “we can do X if Y”
- “I’ve escalated this internally”
If you let the model promise things, it will promise things. It’s trying to be helpful.
PII / secrets redaction
If your customer profile includes secrets, redact before the model sees it. If you don’t, you’ll eventually paste a token into an email draft. Then you’ll enjoy rotating credentials at 03:00.
Rate limits and budgets
Support traffic spikes are real. When the queue is on fire, costs go up fast. Budgets protect you from “helpful” retries during outages.
Production-style code (with artifacts + audit)
This is still simplified, but it shows the shape:
from dataclasses import dataclass
import time
import uuid
from typing import Any
@dataclass(frozen=True)
class Budget:
max_steps: int = 20
max_seconds: int = 45
def draft_support_reply(ticket_id: str, *, tools, budget: Budget) -> dict[str, Any]:
request_id = uuid.uuid4().hex
started = time.time()
ticket = tools.call("tickets.get", args={"id": ticket_id}, request_id=request_id)
customer = tools.call("customers.get", args={"id": ticket["customer_id"]}, request_id=request_id)
policy = tools.call("policy.search", args={"q": "refund policy", "k": 5}, request_id=request_id)
# redact before model
safe_customer = redact(customer) # (pseudo)
draft = llm_write_reply(ticket=ticket, customer=safe_customer, policy=policy) # (pseudo)
artifact_id = tools.call(
"artifacts.put",
args={"type": "support_draft", "ticket_id": ticket_id, "draft": draft},
request_id=request_id,
)
tools.call(
"audit.emit",
args={"type": "support.draft.created", "ticket_id": ticket_id, "artifact_id": artifact_id},
request_id=request_id,
)
return {
"ticket_id": ticket_id,
"draft": draft,
"artifact_id": artifact_id,
"requires_human_approval": True,
"request_id": request_id,
}import crypto from "node:crypto";
export async function draftSupportReply(ticketId, { tools, budget }) {
void budget;
const requestId = crypto.randomUUID().replace(/-/g, "");
const started = Date.now();
const ticket = await tools.call("tickets.get", { args: { id: ticketId }, requestId });
const customer = await tools.call("customers.get", { args: { id: ticket.customer_id }, requestId });
const policy = await tools.call("policy.search", { args: { q: "refund policy", k: 5 }, requestId });
const safeCustomer = redact(customer); // (pseudo)
const draft = await llmWriteReply({ ticket, customer: safeCustomer, policy }); // (pseudo)
const artifactId = await tools.call(
"artifacts.put",
{ args: { type: "support_draft", ticket_id: ticketId, draft }, requestId },
);
await tools.call(
"audit.emit",
{ args: { type: "support.draft.created", ticket_id: ticketId, artifact_id: artifactId }, requestId },
);
void started;
return { ticket_id: ticketId, draft, artifact_id: artifactId, requires_human_approval: true, request_id: requestId };
}Yes, it’s more plumbing. But plumbing is cheaper than angry customers.
Real failure
We saw a team add “email.send” because “it’s just a draft anyway”. The model interpreted “send draft to customer” literally.
Impact:
- ~20 wrong emails sent in a day (not catastrophic, but embarrassing)
- hours of cleanup
- trust hit with support team (“don’t touch the bot”)
Fix:
- separate tool for “create draft” vs “send”
- require human approval for any write/send tool
- store drafts as artifacts with a clear audit trail
Why people do this wrong
- They optimize for automation rate instead of error rate.
- They mix “internal notes” and “customer response” in the same channel.
- They let the model decide when to send.
Trade-offs
- Human approval adds latency.
- You get fewer fully-automated resolutions.
- You also get fewer incidents. Worth it.
What we measure (so it doesn’t quietly get worse)
Support agents degrade over time because:
- product policies change
- templates change
- the ticket distribution changes (new issues)
We track a few boring metrics:
- % drafts approved without edits
- % drafts needing “major rewrite”
-
of “unsafe” suggestions caught in review (refund promises, policy violations)
- p95 runtime (if it spikes, the tool layer is probably failing)
If “major rewrite” rate climbs, don’t tune prompts first. Look at:
- tool context quality (are you fetching the right KB items?)
- policy snippets (are you feeding outdated rules?)
- redaction (did you remove the useful context by accident?)
A template strategy (so replies don’t sound like a bot)
The model is good at prose. It’s bad at consistency.
We give it structure:
- greeting
- short acknowledgement
- 1–3 bullets of actions taken / next steps
- questions (only if required)
- closing
Then we keep the model inside that structure. Not because we love templates. Because support teams hate surprises.
When we consider auto-send (rarely)
Auto-send is a maturity milestone, not a starting point.
We only consider it when:
- the tool layer can prove the draft is safe (policy checks)
- the action is reversible or low-risk
- we’ve seen enough volume to trust the failure rate
And even then, we start with:
- internal tickets
- or low-tier customers
- or informational replies with no commitments
Approval UX (how to make humans actually approve)
If approvals feel annoying, people will either:
- rubber-stamp them
- or bypass the agent entirely
So we keep approvals lightweight:
- show the draft
- show the “claims” the draft makes (refund? SLA? escalation?)
- show which internal policy snippets were used
- show what tools were called (read-only trace)
Good approval UI answers:
- what will be sent?
- what are we promising?
- what’s the customer impact if this is wrong?
If your approval UI is “here’s 40 lines of JSON args”, it won’t work.
Escalation & handoff (so humans don’t start cold)
The goal isn’t “replace support”. The goal is “make the next human step faster”.
When the agent can’t safely draft (high-risk category, missing context, policy conflict), it should still produce a useful handoff artifact:
- 5–10 line summary of what the user reported
- suspected category (billing/bug/how-to)
- what data was pulled (account status, plan, last incidents)
- what it tried (KB hits, similar tickets)
- what it refused to do (writes, refunds) and why
- links to artifacts + trace (so a senior can audit quickly)
We’ve seen this cut handle time by ~20–40% on repetitive tickets, even when the agent never sends a single email.
handoff = {
"ticket_id": ticket_id,
"summary": summarize(ticket),
"suspected_kind": classify_ticket(ticket),
"kb_hits": [x["id"] for x in kb],
"stop_reason": "high-risk: billing_refund",
}
tools.call("tickets.add_internal_note", args=handoff, request_id=request_id)
tools.call("tickets.assign", args={"id": ticket_id, "team": "billing"}, request_id=request_id)const handoff = {
ticket_id: ticketId,
summary: summarize(ticket),
suspected_kind: classifyTicket(ticket),
kb_hits: kb.map((x) => x.id),
stop_reason: "high-risk: billing_refund",
};
await tools.call("tickets.add_internal_note", { args: handoff, requestId });
await tools.call("tickets.assign", { args: { id: ticketId, team: "billing" }, requestId });If your agent stops with “I can’t help”, it’s not an agent. It’s a fancy error message. We also attach the last ~20 tool calls to the handoff, because “trust” starts with “show me what it touched”. It helps in postmortems too.
Common edge cases (the ones that bite you)
Angry users
The model will try to de-escalate. Sometimes it does. Sometimes it says something that makes it worse.
We add a simple rule:
- be concise
- don’t argue
- don’t promise
Also: don’t let the model decide refunds or credits. That’s policy, not tone.
Multi-language tickets
If you operate globally, you’ll see:
- tickets in multiple languages
- internal notes in English
Make sure your pipeline is explicit:
- detect language
- draft in the customer’s language
- keep internal notes in your team language
Otherwise you’ll leak internal notes into customer-facing text.
Attachments and screenshots
Support tickets often include screenshots. Those can contain secrets.
If you OCR images and feed the text into the model:
- redact aggressively
- log access (audit)
- keep budgets tight (OCR can be expensive)
If you can avoid it, avoid it.
Testing (yes, even for “just drafting”)
We run tests against:
- policy violations (refund promises, SLA guarantees)
- “unsafe” phrases (committing to actions we can’t do)
- PII leakage (tokens, internal links, secrets)
This isn’t perfect. But it catches the big failures.
If you ship without tests, you’ll learn about the failures from customers.
When NOT to use this
Don’t use a support agent when:
- you can’t safely separate read vs write tooling
- you can’t review outputs
- you don’t have a place to store drafts + audit logs
Link it up
- Foundations: Tool calling
- Control layer: Tool permissions
- Failure mode: Infinite loop