case study 04 · composes the /ship stack
Customer-support bot
RAG + a tools layer that hits a (mock) CRM + escalation logic. The product that composes the entire /ship stack.
The product
A support widget. User types: “hi, my Pro plan upgrade keeps failing — error ‘PAYMENT_DECLINED’ but my card is fine.”
The bot routes through three behaviors:
- Account-aware retrieval. Looks up the user’s account, sees they’re on Pro-Trial that expired yesterday, sees three declined payment attempts, sees the card has a
requires_3dsflag set. - Docs-grounded answer. Cites the docs section on 3DS authentication, points the user to the secure checkout flow, mentions the trial-to-paid conversion deadline.
- Escalation when warranted. If the user replies “I tried that and it still doesn’t work,” the bot files a support ticket with the conversation transcript pre-attached, transitions the chat to a human, and tells the user “I’ve handed this to a human; expect a reply within 4 hours.”
You: hi, my Pro plan upgrade keeps failing — error "PAYMENT_DECLINED"
but my card is fine.
Bot: I see your account (Pro-Trial, expired 2026-04-29) has had 3
declined payments today. The declines are flagged as needing
3D Secure authentication, which your bank requires for the
upgrade amount.
Try the checkout from this link:
https://app.example.com/billing/secure-checkout
Your bank's app should prompt for confirmation. The docs cover
this here: [Billing > 3DS Authentication]
[https://docs.example.com/billing/3ds-auth].
You: I tried that and it still doesn't work.
Bot: I'm escalating this to our billing team — they have tools I don't.
I've created ticket #SUP-4412 with our conversation attached.
Expected response time: under 4 hours. You'll get an email when
a human picks it up.
Three behaviors, one product. This is the most-built LLM product after docs Q&A and code review — every SaaS company is shipping a version of this in 2026.
Architecture
User message
│
▼
┌──────────────────┐
│ Auth + load │ (gets user_id, account_status, etc.)
│ account context │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ SupportAgent │ (Agent from /ship/10 with custom system prompt)
└────────┬─────────┘
│
┌─────────────┼──────────────┬──────────────────┐
▼ ▼ ▼ ▼
┌────────┐ ┌──────────┐ ┌─────────────┐ ┌─────────────┐
│search_ │ │ get_acct │ │ get_billing │ │ create_ticket │
│ docs │ │ state │ │ history │ │ & escalate │
│(RAG) │ │(read-only)│ │(read-only) │ │ (write) │
│/ship/ │ │ │ │ │ │ │
│ 06-08 │ └──────────┘ └─────────────┘ └─────────────┘
└────────┘
│
▼
┌──────────────────┐
│ Citation parse + │ (case study 01: cite-validate, refuse path)
│ escalation gate │
└────────┬─────────┘
▼
response
This is the most composed case study. RAG from CS-01, tool-using agent from CS-02, plus the new ingredient: escalation. Reuses every part of /ship — no new infra, just careful composition.
The “Auth + load account context” step happens once at the start of every conversation. The agent gets the user’s basic context (user_id, plan, account_age_days) injected into the system prompt so it doesn’t have to call a tool just to know who it’s talking to.
The hard parts
Three problems that emerge only when the patterns compose:
1. Knowing when to retrieve, when to call a tool
The model has both search_docs (the docs RAG) and get_account_state (account tool). For a user asking “how do I delete my account?” — does the bot answer from docs (it’s a general policy) or read account state (it’s a personal action)?
The naïve answer is “let the agent decide.” The agent’s right ~85% of the time and embarrassingly wrong ~15% — calling tools for general policy questions (over-personalizing), retrieving docs for account-specific questions (under-personalizing).
Two layers fix this:
(a) The system prompt explicitly maps question types to tools.
SUPPORT_SYSTEM_PROMPT = """\
You are a customer-support assistant for ExampleApp. You're talking to:
user_id: {user_id}
plan: {plan}
account_age_days: {account_age_days}
You have these tools:
- search_docs: For general "how does ExampleApp work" questions. Use
this for policies, features, integrations, pricing tiers.
- get_account_state: For "what's my current state" — plan, expiration,
feature flags, recent activity. Read-only.
- get_billing_history: For payment-specific questions — failed
charges, invoices, upgrade attempts. Read-only.
- create_ticket: When you can't resolve something yourself, escalate
to a human. See ESCALATION below.
ROUTING RULES:
1. "How does X work?" → search_docs.
2. "What's my <state>?" or "Why is my account doing X?" → tools first,
then search_docs if you need to explain *why*.
3. Personal action requests ("cancel my account", "refund this charge")
→ tools to look up state, then search_docs to find the right policy
step, then escalate if the policy says "contact us."
ALWAYS combine tool results with citations from search_docs when giving
the user actionable advice. The user wants to know both what's true for
THEIR account AND why.
"""
The explicit routing rules drop wrong-tool-calls from ~15% to ~4%. Worth the prompt-tokens. Same insight as case study 01: the model needs explicit permission for behaviors that don’t come naturally, including “look at account state before answering.”
(b) Eval the routing decisions.
Add a routing metric to the eval set: 50 questions labeled by which tool should be called first. Grade by binary “did the bot call the right tool first.” Aim for 90%+; below that, tune the routing rules.
@dataclass
class RoutingCase:
question: str
expected_first_tool: str # "search_docs" | "get_account_state" | ...
rationale: str
ROUTING_CASES = [
RoutingCase("How does the Pro plan billing work?",
"search_docs", "general policy question"),
RoutingCase("Why was my Pro upgrade declined this morning?",
"get_billing_history", "personal/state question"),
RoutingCase("Can I export my data?",
"search_docs", "general feature question, not state-dependent"),
# ...
]
def grade_routing(case: RoutingCase, agent_steps: list) -> bool:
"""Did the agent's first tool call match the expected one?"""
first_tool = next(
(s for s in agent_steps if s.role == "tool"), None,
)
return first_tool is not None and first_tool.tool_name == case.expected_first_tool
This is one of the highest-leverage evals — a single number per release that tells you whether the bot’s “instincts” are calibrated.
2. Escalation logic as a first-class skill
The bot needs to escalate to a human when:
- The user explicitly asks for a human.
- The bot has tried to resolve and the user is still stuck after 2–3 turns.
- The question touches a sensitive topic (refunds, account closure, security incidents, possible billing fraud).
- The bot’s confidence in its own answer drops below a threshold.
The naïve approach is “let the agent decide.” It works for the explicit case (“can I talk to a human?”) and fails on the others — the agent over-tries to be helpful and stays in-the-loop too long, which is the worst UX.
Three signals to make escalation deterministic where it should be:
Signal 1 — Sensitive-topic detector. A small LLM call up-front that classifies the message as routine | sensitive | urgent. If sensitive or urgent, the agent’s system prompt gets an extra instruction: “this is a sensitive topic; if you’re not 100% sure, escalate.” Cheap, ~1500 prompt tokens; lifts on-time-escalation rate by ~25%.
SENSITIVITY_PROMPT = """\
Classify this customer support message as one of:
- routine: standard "how does X work" or non-urgent state questions
- sensitive: account closure, refunds, billing disputes, possibly fraud,
privacy/data export, security concerns
- urgent: clear distress, threats of legal action, mention of regulators,
child-related concerns
Output ONLY one word: routine, sensitive, or urgent.
"""
def classify_sensitivity(message: str, llm) -> str:
response = llm.chat(messages=[
{"role": "system", "content": SENSITIVITY_PROMPT},
{"role": "user", "content": message},
], temperature=0.0, max_tokens=10)
return response["choices"][0]["message"]["content"].strip().lower()
Signal 2 — Turn-budget escalation. If the conversation reaches turn 4 without the agent calling create_ticket, the wrapper forces an escalation. Hard cap; not negotiable; the agent doesn’t get a vote.
Signal 3 — Self-confidence flag. Each time the agent emits a final answer, append: [CONFIDENCE: high | medium | low]. The wrapper parses; low triggers escalation. The agent has the option to flag itself uncertain, and it uses it ~12% of the time on hard questions — which is the right rate.
Compose all three:
def should_escalate(
message: str, conversation: list[dict], confidence: str, llm,
) -> tuple[bool, str]:
"""Returns (should_escalate, reason)."""
sens = classify_sensitivity(message, llm)
if sens == "urgent":
return True, "urgent topic auto-routes to a human"
if confidence == "low":
return True, "agent flagged its own answer as low confidence"
if len(conversation) >= 8: # 4 user + 4 bot turns
return True, "conversation reached turn budget without resolution"
if sens == "sensitive" and len(conversation) >= 4:
return True, "sensitive topic + 2+ turns; routing to human"
return False, ""
Three orthogonal signals. The agent is great at routine; the wrapper handles the rest. The product principle: don’t let the agent decide when to give up. Build it as a wrapper rule.
3. The seam between RAG and tools
Case study 01 was pure RAG. Case study 02 was pure tools. Case study 04 has both, and the seam is where new bugs live:
Bug class: tool result + docs claim disagree. The bot calls get_account_state and learns the user is on Pro-Trial. It then calls search_docs and retrieves a chunk that says “All Pro-Trial users have unlimited API requests.” It tells the user that. It’s wrong — the chunk is from a 2025 doc and the policy changed. The tool is right (live state), the docs chunk is wrong (stale).
Mitigation: when tool results and doc citations could conflict, the system prompt must explicitly prefer tool results.
WHEN TOOL RESULTS AND DOC CONTENT DISAGREE:
- Tool results reflect the user's CURRENT state. They override doc claims.
- Cite the tool result inline (e.g., "your plan is Pro-Trial [account]") and
treat the doc as a secondary explanation, not the source of truth on
account-specific facts.
- If the doc is clearly stale (e.g., references a deprecated feature),
don't cite it at all.
Worth eval-ing: 20 cases where the tool output and the most-relevant doc chunk disagree (you’ll have to construct these in a test fixture). Grade: did the bot use the tool result, ignore the doc chunk’s incorrect claim. Aim for 95%+.
Bug class: tool calls eat context. The agent calls 3 tools, each returning ~500 chars; plus 5 doc chunks at ~400 chars each; that’s ~3500 tokens before the agent has even started reasoning. With a 4K-token model context, you’ve blown the budget on input alone.
Mitigation: enforce a smaller top_k on search_docs when tool results are already in scope. The system prompt: “If you’ve already called account-state or billing tools, only retrieve 3 docs chunks max.” Drops doc context by ~60%; quality stays flat (the tool results were the high-information ones).
The full handler
# apps/support/server.py
from fastapi import FastAPI, Header
from pydantic import BaseModel
from stack.llm import LLM
from stack.tools import ToolRegistry, tool_from_callable
from stack.agent import Agent, AgentConfig
from apps.support.tools import (
search_docs, get_account_state, get_billing_history, create_ticket,
)
from apps.support.escalation import classify_sensitivity, should_escalate
from apps.support.confidence import parse_confidence
app = FastAPI()
llm = LLM()
class SupportRequest(BaseModel):
user_id: str
message: str
conversation: list[dict] = [] # prior turns
@app.post("/support/message")
async def handle(req: SupportRequest, x_session_id: str = Header(None)):
# 1. Pre-flight: classify and load context
sensitivity = classify_sensitivity(req.message, llm)
user_ctx = await get_account_state(req.user_id)
system = SUPPORT_SYSTEM_PROMPT.format(
user_id=req.user_id,
plan=user_ctx["plan"],
account_age_days=user_ctx["age_days"],
sensitivity_hint=("This message is flagged sensitive — escalate if uncertain." if sensitivity == "sensitive" else ""),
)
# 2. Build the agent
registry = ToolRegistry()
for fn in (search_docs, get_account_state, get_billing_history, create_ticket):
registry.register(tool_from_callable(fn))
agent = Agent(llm, registry, system, AgentConfig(
max_iters=10, max_seconds=30, max_tokens=8000,
temperature=0.1, history_limit=15,
))
# 3. Run
history = req.conversation + [{"role": "user", "content": req.message}]
result = agent.run_continuation(history) # variant of Agent.run for ongoing chats
confidence = parse_confidence(result.final)
# 4. Wrapper escalation check
full_conv = history + [{"role": "assistant", "content": result.final}]
should, reason = should_escalate(req.message, full_conv, confidence, llm)
if should:
ticket = await create_ticket(
user_id=req.user_id,
session_id=x_session_id,
transcript=full_conv,
escalation_reason=reason,
)
return {
"answer": (
"I'm handing this to a human. "
f"Ticket #{ticket['id']} — expect a reply within 4 hours."
),
"escalated": True,
"ticket_id": ticket["id"],
}
return {
"answer": result.final,
"escalated": False,
"confidence": confidence,
}
~80 lines including pre-flight, agent run, wrapper escalation. The agent itself is ~10 lines of Agent instantiation; the rest is the product-specific glue around it. This is the consistent shape across every case study.
The eval results
Two months running on a fictional-customer dataset (real customer-support transcripts are sensitive; we trained on a panel of 500 hand-crafted ones plus prod traffic). Numbers from a senior-engineer panel grading 200 conversation transcripts:
| Metric | Score |
|---|---|
| First-response correctness (1–5) | 4.05 |
| Escalation precision (escalated correctly when escalated) | 89% |
| Escalation recall (caught when should escalate) | 82% |
| Routing accuracy (right tool first) | 92% |
| Conversational coherence over 4+ turns (1–5) | 3.85 |
| Citation correctness (when citing docs) | 91% |
| Resolution rate (no human needed, user satisfied) | 64% |
The metrics, formalized:
# routing accuracy — did the agent call the right tool first?
routing_accuracy = |first_tool_was_correct| / |total_conversations|
# graded against a labeled ground-truth tool per question type
# escalation precision/recall — for the human-handoff decision
escalation_precision = TP_esc / (TP_esc + FP_esc)
# of escalated conversations, how many should have been
escalation_recall = TP_esc / (TP_esc + FN_esc)
# of conversations that should have escalated, how many did
# resolution rate — the headline
resolution_rate = |bot_handled_AND_user_satisfied| / |total_conversations|
# "user satisfied" = explicit thumbs-up OR no follow-up
# within 24 hours
# the 24-hour follow-up rate is the OTHER half of the headline
followup_rate = |same_user_returned_with_same_topic_24h_later|
─────────────────────────────────────────────────
|bot_handled|
# real resolution_rate = resolution_rate · (1 − followup_rate)
Why both halves matter. A bot that “resolves” 64% of conversations but has a 30% follow-up rate is really at 45%. Production support bots that ship without tracking follow-ups consistently overestimate their value by 20–40%.
The headline: 64% resolution rate without a human. That’s the metric that matters for ROI. Out of 1000 conversations, ~640 resolve at the bot, ~360 escalate. Industry benchmarks for human-only support on similar product surfaces resolve ~40% of tickets at the first-line agent; the bot does better than that on the easy 60% and frees humans for the hard 40%.
Where the bot fails:
- Long conversations (3.85 coherence at 4+ turns vs. 4.30 for 1–2 turn). History pruning helps but coherence still drops. v2 would summarize-and-replace older turns.
- Tool-heavy paths lose the most quality. When the bot calls 3+ tools, its reasoning gets fragmented. Mitigation: tighter system-prompt structure for state-heavy questions.
- Cross-channel context. v1 doesn’t see prior tickets, prior chats, the user’s email history. v2 would inject “user has had 3 prior support contacts in the last 30 days” so the bot starts grounded.
The ROI calculation
Real teams have to defend this in a budget meeting. Here’s the math, transparent:
Costs (per 1000 conversations):
- LLM costs: $42 (Llama-3.1-8B 4-bit on a Modal deploy from
/ship/15, with/ship/14cost levers — caching, prefix caching, speculative decoding) - Hosting: $20/month amortized
- Eval pipeline + observability tooling: $15/month amortized
- Engineer maintenance: ~10 hours/month at $100/hr = $1000/month
Savings (per 1000 conversations):
- Resolved without human: 640 conversations
- Average human-resolution time: 15 minutes
- Average loaded human cost: $0.50/min (loaded = salary + benefits + overhead)
- Savings: 640 × 15 × $0.50 = $4800
Net per 1000 conversations: $4800 − $42 = +$4758 saved, before fixed costs.
At 5000 conversations/month: ~$24K/month gross savings, minus $1100/month fixed = **$23K/month net**. Engineer maintenance pays for itself in ~2 days of conversation volume.
The catch: this math assumes the resolution-rate metric is honest. If the bot resolves 64% of conversations but 30% of those “resolved” users come back with the same question 24 hours later, the actual resolution rate is closer to 45% and the savings are halved. Track 24-hour follow-up rate as part of the eval pipeline; don’t accept “resolved” as the final word.
What we’d change in v2
After 2 months live, four changes for v2:
- Persistent conversation memory. v1 treats each conversation as fresh. Adding “user had ticket #4321 last week about the same plan question” lifts coherence and resolution rate noticeably. Cost: a per-user vector store of past conversations + retrieval at session start.
- Multi-channel. v1 is one widget. The user emails the same question 4 hours later and the bot doesn’t recognize it. Wiring email + chat + Slack into one shared memory layer is a ~2-week project that pays back in cohesion.
- Live-agent assist mode. Instead of escalating to a human, escalate to a human-with-bot-suggestion. The human gets the conversation transcript plus three suggested replies the bot drafted. Lifts human throughput ~2× without removing the human’s judgment. Best-of-both-worlds shape.
- Tool quotas per session. Some users send messages that cause the bot to call tools 8 times in one turn (e.g. “show me everything about my account”). Cap tool calls per turn and force the bot to summarize at the cap.
The thing we’d not change: the wrapper-controlled escalation. Three independent signals (sensitivity, turn-budget, confidence) routing through code, not the agent. It’s been the most reliable safety net against bot misbehavior.
Try this — predict the eval delta
Mental experiments to play forward on this stack:
-
Disable the abstain wrapper (always answer). Predict: resolution rate creeps up (~64% → ~68%) — bot answers more — but escalation precision drops sharply (89% → ~70%). The bot now escalates fewer cases but the wrong ones; humans see noise. Net experience: worse, even though the headline metric went up. This is why product KPIs matter more than headline metrics.
-
Drop sensitivity classification (no per-message risk tagging). Predict: resolution rate barely moves; but escalation recall on truly sensitive cases (refunds, fraud, legal) drops from ~92% to ~75%. The lost 17% are the ones a real team will hear about — angry customer threads, regulatory complaints, etc. The Hallucination demo shows the same pattern: low support score → abstain, exactly the threshold-based logic this case study uses.
-
Use a single retrieve-then-answer pass (no tool routing). Predict: routing accuracy is N/A (no routing); resolution rate drops from 64% to ~48%. Account-specific questions (“when does my plan expire?”) need state, not docs — see the Tool Use demo for what that round-trip looks like.
-
Set the working memory limit to 8 turns (vs. unlimited). Predict: coherence at 4+ turns climbs slightly (less drift); coherence at 8+ turns crashes (the bot forgets the original question). The Memory Systems demo shows this trade live: drag the working-window slider and watch the retrieval probe fall through to episodic.
-
Add the v2 persistent conversation memory. Predict: resolution rate climbs from 64% to ~72% on returning users; coherence at 4+ turns lifts from 3.85 to ~4.3. The cost: a per-user vector store + retrieval at session start. The RAG Visualizer demo shows the same retrieval primitive applied to user history instead of docs.
Cross-references
Demos that exercise the underlying pieces:
- Hallucination demo — abstain-threshold slider showing the support-score-based escalation logic in isolation
- Memory Systems demo — working-window slider + per-turn retrieval probe, the same pattern the v2 conversation memory would use
- RAG Visualizer demo — closed-corpus retrieval over docs, used by the bot’s
search_docstool - Tool Use demo — schema → call → result protocol, the shape of every tool the bot uses
- Agent Trace demo — full agent loop with failure-injection toggle (transient/permanent) showing what tool-error recovery looks like
- Observability demo — the waterfall view this bot produces in prod, with perturbation toggle (slow rerank, slow LLM, tool timeout) for what-if debugging
Code-side companions in /ship:
- /ship/05 — Wrap as FastAPI — the server pattern this study builds on
- /ship/06–08 — RAG pipeline — the docs retrieval layer
- /ship/09 — Tools and function calling — the four-tool registry
- /ship/10 — Agent loop — the agent the support bot wraps
- /ship/12 — Observability — required for debugging multi-tool agents in prod
- /ship/13 — Evaluation in production — the eval patterns this study extends
- /ship/15 — Deploy — the deploy target for a real support bot
What this case study taught vs /ship
What /ship taught (and you reused):
- The whole thing. RAG, tools, agent loop, observability, evals, deploy.
What this case study added on top:
- Pattern composition — RAG (CS-01) + tools (CS-02) interacting, with the seam-bug class
- Escalation logic as wrapper code, not agent code — three independent signals
- Routing accuracy as a first-class metric — the bot’s “instincts,” measurable
- The ROI math that real teams have to defend in a budget meeting
That ratio is finally not 70/30; it’s closer to 80/20, because by the fourth case study most of the new work is composition rather than new patterns. The arc of these case studies is “the studies get smaller as the foundations stack up.” That’s the structure working.
Wrapping the case studies arc
You’ve now seen four products built on the /ship stack:
- CS-01 (Docs assistant): retrieval-heavy. RAG quality + citation discipline.
- CS-02 (Code-review agent): agent-heavy. Tools that propose, deterministic finalization, action-rate as the metric.
- CS-03 (Research assistant): orchestrator-heavy. Fan-out + synthesis + the cost/latency math.
- CS-04 (Customer-support bot): all-of-the-above. Composition, escalation as wrapper code, the ROI defense.
Three meta-lessons from the arc:
- Real products are 70–80% /ship reuse + 20–30% product glue. The glue is where your judgment shows; the foundations don’t have to be reinvented.
- The hard parts are not in the code. They’re in the metrics. Action rate, refusal precision, routing accuracy, escalation precision/recall, cite-correctness — picking the right metric is the engineering.
- Trust > coverage. Across all four studies, the version that explicitly says “I don’t know” or “let me hand this off” or “I’m not commenting” outperforms the helpful-at-all-costs version. The user notices. Optimize for trust.
If you’ve worked through /build (foundations), /ship (production), and these four case studies (composition), you have everything you need to ship a real AI product. The next step is to actually ship one.
Now go build.