case study 03 · composes the /ship stack
Research assistant
Multi-agent fan-out for cited briefs. The product /ship/11 wants to be — with the cost/latency trade-offs in real numbers.
The product
A /research endpoint. You POST a question. 60–90 seconds later you get:
{
"brief": "Markdown text, ~600 words, with [n] citations …",
"citations": [
{"n": 1, "url": "https://blog.langchain.dev/...", "title": "..."},
{"n": 2, "url": "https://arxiv.org/abs/...", "title": "..."},
...
],
"trace_id": "...",
"stats": {
"subtasks": 4,
"workers_succeeded": 4,
"total_tokens": 18420,
"wall_clock_seconds": 64.2,
"critic_verdict": "SHIP_IT"
}
}
The brief looks like this (truncated):
## What's the state of speculative decoding for OSS models in 2026?
Speculative decoding is now production-ready for the major OSS instruct
families. vLLM 0.6+ ships native support; the standard pattern is a
small (1B–3B) draft model alongside a larger (8B–70B) target [1] [2].
**Latest measurements.** Llama-3.1-8B with Llama-3.2-1B as draft yields
a 1.4–1.8× throughput multiplier on common Q&A workloads [1]; gains
shrink to ~1.2× on creative-generation workloads where draft accuracy
falls [3]. The verification step preserves the target's distribution
exactly — speculative decoding is loss-free in expectation [4].
**What changed in 2026.** Two things: (a) tree-based speculation, where
the draft proposes a tree of K candidates verified in parallel, lifts
gains to 2.0–2.5× on benchmarks [5]…
Five-paragraph brief, six to ten citations, all clickable to actual sources. Useful for a literature review, a competitive scan, or “what’s the current state of X.” The kind of artifact a smart engineer would produce in 30 minutes; we ship it in 60–90 seconds for a few cents.
This is the product shape behind Perplexity’s deep research, Elicit’s literature reviews, and the “research” mode in every consumer AI product. The /ship orchestrator handles 70% of the work; this case study is about the 30% the curriculum doesn’t directly answer.
Architecture
/research request
│
▼
┌──────────────┐
│ Supervisor │ (one LLM call: question → 3–5 subtasks)
└──────┬───────┘
│
┌──────┴──────┬──────────────┬──────────────┐
▼ ▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│Worker 1 │ │Worker 2 │ │Worker 3 │ │Worker 4 │ each = /ship/10 Agent
│ web_ │ │ web_ │ │ web_ │ │ web_ │ with web_search,
│ search │ │ search │ │ search │ │ search │ fetch_url, summarize
│ fetch │ │ fetch │ │ fetch │ │ fetch │ tools
└────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘
│ │ │ │
└────────────┴──────┬───────┴──────────────┘
│ fan-in
┌────────▼─────────┐
│ Combiner (LLM) │ synthesize, NOT just concatenate
└────────┬─────────┘
│
┌────────▼─────────┐
│ Critic (LLM) │ fact-check + recommend revisions
└────────┬─────────┘
▼
brief + cites
Built directly on the orchestrator from /ship/11. The supervisor decomposes; workers fan out in parallel; combiner produces a synthesis (not a concatenation — see hard part #2); critic reviews and either ships or sends back for one revise pass.
Reuses from /ship: Supervisor, Worker (= Agent), Critic, Orchestrator (/ship/11). Tools reuse the registry pattern from /ship/09. Tracing from /ship/12 is especially important here — multi-agent runs are hard to debug without a trace tree.
New: the web tools, the synthesis combiner, and the cost/latency benchmark you can actually compare against single-agent.
The hard parts
Three things the /ship curriculum doesn’t directly teach:
1. Web search as a tool — costs and trade-offs
Workers need to query the open web. Three options:
| Provider | Cost (approx) | Latency | Result quality | Rate limits |
|---|---|---|---|---|
| Tavily | $0.005/query | ~700 ms | Good, AI-tuned | Generous on paid |
| Brave Search | $5 / 1k queries | ~600 ms | Good, broad | Strict free tier |
| SerpAPI | $50 / 5k queries | ~1.2 s | Excellent (Google) | Pricey |
| Self-hosted SearXNG | Compute only | ~2 s | Variable | None |
For the curriculum we use Tavily — purpose-built for AI workloads (returns clean text excerpts, not just URLs), affordable, and the rate limits are sane. The pattern below works with any of them; swap the client.
# apps/research/tools.py
from __future__ import annotations
import httpx
import os
from stack.tools import tool_from_callable
TAVILY_API_KEY = os.environ["TAVILY_API_KEY"]
def web_search(query: str, max_results: int = 5) -> list[dict]:
"""Search the open web. Returns list of {title, url, snippet}.
Args:
query: The search query. Be specific; you have a token budget.
max_results: Max URLs to return. Default 5 is usually enough.
"""
r = httpx.post(
"https://api.tavily.com/search",
json={
"api_key": TAVILY_API_KEY,
"query": query,
"max_results": max_results,
"include_raw_content": False,
},
timeout=30,
)
r.raise_for_status()
data = r.json()
return [
{"title": h.get("title", ""), "url": h["url"],
"snippet": h.get("content", "")[:500]}
for h in data.get("results", [])
]
def fetch_url(url: str) -> dict:
"""Fetch the readable text of a URL. Returns {url, title, text}.
Args:
url: The HTTPS URL to fetch.
"""
from readability import Document # `readability-lxml`
r = httpx.get(url, timeout=20, follow_redirects=True,
headers={"User-Agent": "research-bot/1.0"})
r.raise_for_status()
doc = Document(r.text)
return {
"url": url,
"title": doc.short_title(),
"text": doc.summary()[:8000], # cap; some pages are gigantic
}
def summarize(text: str, focus: str) -> str:
"""Summarize a chunk of text given a research focus.
Args:
text: The text to summarize. Up to ~8000 chars.
focus: What aspect to focus on (e.g. "performance numbers", "criticisms").
"""
# Implemented inline for clarity; in real code, use stack/llm.LLM.
from stack.llm import LLM
response = LLM().chat(messages=[
{"role": "system", "content":
f"You summarize text. Focus on: {focus}. Output 2–4 sentences."},
{"role": "user", "content": text},
], temperature=0.0)
return response["choices"][0]["message"]["content"]
Three tools. web_search is cheap and returns titles + snippets; fetch_url is the bandwidth cost (you’re pulling a full page); summarize is the LLM cost (compress the page back down to what matters). Workers learn quickly to call summarize after fetch_url because raw web pages eat their context window.
2. Synthesis vs. concatenation
The default orchestrator from /ship/11 uses a mechanical combine() that just concatenates worker outputs into sections. It works. It also produces a brief that reads exactly like four mini-essays stapled together — which is a structural smell users notice immediately.
The synthesis combiner uses an LLM to rewrite the worker outputs into a single coherent brief:
# apps/research/synthesize.py
SYNTHESIS_PROMPT = """\
You will receive research notes from multiple workers, each investigating
one angle of the user's question. Your job is to produce a single
coherent brief that:
1. Has a clear narrative arc (problem → key findings → nuance → outlook).
2. Reconciles disagreements between workers explicitly when they exist.
3. Preserves all factual citations (the [n] markers from the workers).
4. Does NOT introduce facts not present in the worker notes. If the
workers don't cover something, don't make it up.
5. Is concise — aim for 400–700 words.
Output the brief in markdown, with citations as [1], [2], etc. The
citation list will be re-numbered downstream; just preserve the
mapping between cite-marker and source within each worker's notes.
"""
def synthesize(question: str, workers: list, llm) -> str:
"""Re-write worker notes into a single coherent brief."""
notes = "\n\n".join(
f"### Worker {i + 1}: {w.subtask.title}\n{w.result.final}"
for i, w in enumerate(workers)
)
response = llm.chat(messages=[
{"role": "system", "content": SYNTHESIS_PROMPT},
{"role": "user", "content":
f"User asked: {question}\n\n--- WORKER NOTES ---\n{notes}"},
], temperature=0.0)
return response["choices"][0]["message"]["content"]
One extra LLM call. ~1500 prompt tokens, ~700 response tokens. ~$0.005 on Llama-3.1-8B locally. Worth every penny — synthesized briefs scored 4.2 vs. concatenated briefs at 3.1 on a 1–5 quality grade from a senior-engineer panel.
The constraint “do NOT introduce facts not present in the worker notes” is critical. Without it, the synthesis pass hallucinates “context” to make the brief sound smoother — and you’ve now introduced uncited claims into a research artifact. Test for this explicitly: in eval, run cite-correctness over synthesized briefs vs. raw worker notes; if synthesis drops cite-correctness more than 2 points, your prompt isn’t holding the line.
3. Cost/latency on a real query
The benchmark the /ship/11 article promised but couldn’t show on synthetic data. We ran the same query through three configurations on a real research question (“What changed in OSS speculative decoding through 2026?”), all on Llama-3.1-8B via vLLM:
| Config | Wall clock | Total tokens | Quality (1–5) |
|---|---|---|---|
| Single agent (web_search/fetch) | 28 s | 6 200 | 3.4 |
| 3-worker fan-out, no synthesis | 22 s | 14 800 | 3.6 |
| 4-worker fan-out + synthesis + critic | 64 s | 18 400 | 4.2 |
Reading the table:
- Multi-agent with synthesis is ~3× tokens, ~2.3× wall-clock, +0.8 quality. The quality jump is the win.
- Multi-agent without synthesis is barely better than single-agent, despite using 2.4× tokens. Don’t pay for fan-out and skip the synthesis — that’s the worst trade.
- Workers run in parallel, but the synthesis + critic add serial latency. Hence wall-clock 64s on 4-worker vs. 28s single-agent. If your product can’t tolerate 60s+, multi-agent is wrong for you regardless of quality wins.
For a research-brief product, the quality is worth it. For an interactive chatbot, it’s almost certainly not — the 60s wait dominates.
The full /research handler
# apps/research/server.py
import asyncio
from fastapi import FastAPI
from pydantic import BaseModel
from stack.llm import LLM
from stack.tools import ToolRegistry, tool_from_callable
from stack.agent import AgentConfig
from stack.orchestrator import Supervisor, Critic, run_workers_parallel
from apps.research.tools import web_search, fetch_url, summarize
from apps.research.synthesize import synthesize
app = FastAPI()
llm = LLM()
class ResearchRequest(BaseModel):
question: str
n_workers_max: int = 4
@app.post("/research")
async def research(req: ResearchRequest):
# 1. Decompose
supervisor = Supervisor(llm)
subtasks = supervisor.decompose(req.question)[:req.n_workers_max]
# 2. Each worker gets the three web tools.
registry = ToolRegistry()
for fn in (web_search, fetch_url, summarize):
registry.register(tool_from_callable(fn))
workers = await run_workers_parallel(
llm, registry, subtasks,
AgentConfig(max_iters=8, max_seconds=45, max_tokens=8000,
history_limit=12),
)
# 3. Synthesize (LLM call, replaces /ship/11's mechanical combine)
draft = synthesize(req.question, workers, llm)
# 4. Critique
critic = Critic(llm)
critique = critic.review(req.question, draft)
# 5. If critic says revise, run a single revise pass.
if not critique.strip().upper().startswith("SHIP IT"):
draft = synthesize(
req.question + "\n\nCritic notes to address:\n" + critique,
workers, llm,
)
total_tokens = sum(w.result.total_tokens for w in workers)
return {
"brief": draft,
"trace_id": "...", # filled by tracing middleware
"stats": {
"subtasks": len(subtasks),
"workers_succeeded": sum(
1 for w in workers if w.result.stop_reason == "done"
),
"total_tokens": total_tokens,
"wall_clock_seconds": 0, # filled by middleware
"critic_verdict":
"SHIP_IT" if critique.startswith("SHIP IT") else "REVISED",
},
}
~50 lines of glue on top of /ship/11. The orchestrator does the heavy lifting; this case study mostly contributes the web tools, the synthesis prompt, and the revise-on-critique pattern.
The eval results
Three weeks running on a panel of 50 research questions across topics (technical OSS, finance, science, history). Each brief graded by a senior-engineer panel (3 judges, 1–5 scale, mean reported).
| Metric | Score |
|---|---|
| Brief quality (1–5) | 4.18 |
| Cite-correctness (cited sources contain the claim) | 87% |
| Critic-revise rate (briefs that needed a revise) | 18% |
| Wall-clock p50 / p99 | 64s / 110s |
| Total tokens p50 | 18 400 |
| $/brief on Llama-3.1-8B (4-bit) + Tavily | $0.07 |
The cost math, formalized:
# per brief — the orchestrator runs:
cost = supervisor_call # decompose into N subtasks
+ N · worker_avg_cost # parallel; wall-clock = max(workers)
+ critic_call # review draft
+ (revise_rate · synthesis) # 18% of runs do an extra synthesis
# token-level breakdown for a typical 4-worker brief:
worker_tokens ≈ 3000 (each, including web fetches summarized)
synthesis_tokens ≈ 2500 (sees ~12k worker output, emits ~700)
critic_tokens ≈ 1500
total ≈ 4·3000 + 2500 + 1500 = 16,000 (close to observed 18,400)
# wall-clock:
wall_clock = max(supervisor + worker_max + synthesis + critic)
= ~3s + ~50s + ~7s + ~4s = ~64s (matches p50)
The 3× token cost vs. single-agent (~6,000 tokens) buys ~2× wall-clock savings (parallel workers) plus the breadth lift in brief quality. It’s the breadth that pays the bill — not the speedup.
Where the bot fails:
- Niche / paywalled sources drop the cite-correctness score. The bot can fetch a HuggingFace blog post fine; can’t fetch a paywalled IEEE paper. Worker often substitutes a less-authoritative source. Mitigation: explicit prompt to prefer primary sources; lifts cite-correctness from 81% to 87%.
- Very recent topics (last 2 weeks) drop quality. Web search has indexing lag; the bot can be slightly out of date. Mitigation: a
published_afterfilter onweb_searchfor time-sensitive queries, plus a disclaimer in the brief. - Numerical claims are the most-likely fabrications in synthesis. “Improved by 1.4×” is an easy thing for a synthesis pass to round to “improved by 1.5×.” Mitigation: in the synthesis prompt, “preserve numbers exactly as cited; never round.” Drops numerical-fabrication rate from ~4% to under 1%.
What we’d change in v2
After 3 weeks of internal use, three changes for v2:
- Streaming the brief. Right now users wait 60s with no feedback. We’d stream the synthesis output token-by-token (it’s ~700 tokens; ~10s at 70 tok/s). UX win, no quality cost. Workers still run silent in the background; the streamed brief is the synthesis pass.
- A “deeper” mode. Some questions warrant more workers (8–10) and a longer brief (1500 words). Off by default; users opt in for cost-aware research. Cost ~$0.20 per brief, ~3 minutes. The right shape for “I’m doing real research, not browsing.”
- Source-quality scoring. Right now we treat every URL the workers fetch equally. A heuristic (“prefer .edu, .gov, primary papers, named-author posts; deprioritize content farms”) would lift cite-correctness another 3–5 points. Easy to add in
web_searchpost-processing.
The thing we’d not change: the synthesis pass with the “preserve numbers exactly” rule. It’s the difference between “useful brief” and “subtly-wrong brief.” Don’t trade synthesis quality for tokens.
Try this — predict the eval delta
Mental experiments to play forward on this stack:
-
Drop synthesis, just concatenate worker outputs. Predict: brief quality drops from 4.18 to ~3.1 (the original /ship/11’s mechanical combine result). Reader sees four mini-essays stapled together; the narrative arc disappears. Cost saving: trivial (one LLM call). The synthesis pass is the cheapest 1-point quality lift you can buy.
-
Drop the critic. Predict: revise rate goes from 18% → 0% (no revises happen), but cite-correctness drops from 87% → ~80% because hallucinated claims slip through. The critic is your last line of defense; the Multi-Agent demo’s integration prompt panel shows what the supervisor passes to the critic.
-
Use 8 workers instead of 4. Predict: cost ~doubles ($0.07 → $0.14); brief quality plateaus around 4.3 (diminishing returns past 4 workers for most queries). 8 workers earns its keep on truly broad questions (“compare 5 frameworks”) but is wasteful on focused ones (“explain RoPE”). The supervisor’s decomposition is the bottleneck — see how it splits in the Multi-Agent demo.
-
Run with retry budget = 0 (no retries on worker failures). Predict: ~5% of briefs ship degraded (one worker missing); brief quality drops to ~3.8 on those. The Multi-Agent demo’s failure-injection toggle shows exactly this trade — flip retry budget from 0 to 1 and watch the degraded answer become the full answer.
-
Replace Tavily with SerpAPI (Google search). Predict: cite-correctness lifts ~3 points (better authoritative-source coverage); cost rises 5×. SerpAPI shines on factual queries where source authority matters; Tavily’s AI-tuned snippets win on speed/cost.
Cross-references
Demos that exercise the underlying pieces:
- Multi-Agent demo — fan-out / fan-in visualizer with failure injection (transient/permanent error, timeout) and retry budget slider showing degraded vs full answers; plus a live integration prompt panel showing exactly what the supervisor sends back
- RAG Visualizer demo — the closed-corpus retrieval shape under workers’ web-search tool
- Reranker Lab demo — a v2 candidate would rerank fetched URLs by source authority; the same primitive
Code-side companions in /ship:
- /ship/08 — Retrieval — the closed-corpus version of what we’re doing here over the open web
- /ship/09 — Tools and function calling — the tool-registry pattern
- /ship/10 — Build an agent loop — the workers’ inner loop
- /ship/11 — Multi-agent orchestration — the supervisor / workers / critic pattern this study extends
- /ship/14 — Cost and latency tuning — the levers we used to keep cost/brief under $0.10
External:
- Tavily docs — the search API used here
What this case study taught vs /ship
What /ship taught (and you reused):
- The Supervisor / Worker / Critic / Orchestrator structure
- Parallel worker dispatch via thread-pool executor
- Budget propagation (max_iters, max_seconds, max_tokens) per worker
- Tracing nested under a single root span
What this case study added on top:
- Web tools —
web_search,fetch_url,summarizeas a composable triple - Synthesis-not-concatenation — when narrative matters, an LLM combiner is worth one extra call
- Cost/latency benchmarks on a real query — the table in “hard part 3”
- Three failure modes worth knowing — niche sources, recent topics, numerical drift
That ratio (~70% reuse, ~30% new) is the third instance. The pattern of agent-product engineering is “compose /ship, add ~30% glue, measure, iterate.” This is the consistent shape across case studies.
Next
Case study 04 is a customer-support bot — the most demanding shape, because it composes RAG (case study 01) and tools+agent loop (case study 02) and arguably orchestrator-light (deciding when to escalate vs. answer). It’s the hardest case study and the most relatable: every reader has either built or talked to one of these. We’ll see what happens when all three case-study patterns compose, and we’ll close the case-studies arc with the ROI calculation real teams have to do.