case-studies 03 / 05 builds on /ship/08, 09, 10, 11, 12, 14 28 min read · 1h hands-on

case study 03 · composes the /ship stack

Research assistant

Multi-agent fan-out for cited briefs. The product /ship/11 wants to be — with the cost/latency trade-offs in real numbers.

The product

A /research endpoint. You POST a question. 60–90 seconds later you get:

{
  "brief": "Markdown text, ~600 words, with [n] citations …",
  "citations": [
    {"n": 1, "url": "https://blog.langchain.dev/...", "title": "..."},
    {"n": 2, "url": "https://arxiv.org/abs/...", "title": "..."},
    ...
  ],
  "trace_id": "...",
  "stats": {
    "subtasks": 4,
    "workers_succeeded": 4,
    "total_tokens": 18420,
    "wall_clock_seconds": 64.2,
    "critic_verdict": "SHIP_IT"
  }
}

The brief looks like this (truncated):

## What's the state of speculative decoding for OSS models in 2026?

Speculative decoding is now production-ready for the major OSS instruct
families. vLLM 0.6+ ships native support; the standard pattern is a
small (1B–3B) draft model alongside a larger (8B–70B) target [1] [2].

**Latest measurements.** Llama-3.1-8B with Llama-3.2-1B as draft yields
a 1.4–1.8× throughput multiplier on common Q&A workloads [1]; gains
shrink to ~1.2× on creative-generation workloads where draft accuracy
falls [3]. The verification step preserves the target's distribution
exactly — speculative decoding is loss-free in expectation [4].

**What changed in 2026.** Two things: (a) tree-based speculation, where
the draft proposes a tree of K candidates verified in parallel, lifts
gains to 2.0–2.5× on benchmarks [5]…

Five-paragraph brief, six to ten citations, all clickable to actual sources. Useful for a literature review, a competitive scan, or “what’s the current state of X.” The kind of artifact a smart engineer would produce in 30 minutes; we ship it in 60–90 seconds for a few cents.

This is the product shape behind Perplexity’s deep research, Elicit’s literature reviews, and the “research” mode in every consumer AI product. The /ship orchestrator handles 70% of the work; this case study is about the 30% the curriculum doesn’t directly answer.

Architecture

       /research request


       ┌──────────────┐
       │  Supervisor  │  (one LLM call: question → 3–5 subtasks)
       └──────┬───────┘

       ┌──────┴──────┬──────────────┬──────────────┐
       ▼             ▼              ▼              ▼
  ┌─────────┐  ┌─────────┐    ┌─────────┐    ┌─────────┐
  │Worker 1 │  │Worker 2 │    │Worker 3 │    │Worker 4 │   each = /ship/10 Agent
  │ web_     │  │ web_    │    │ web_    │    │ web_    │   with web_search,
  │ search   │  │ search  │    │ search  │    │ search  │   fetch_url, summarize
  │ fetch    │  │ fetch   │    │ fetch   │    │ fetch   │   tools
  └────┬────┘  └────┬────┘    └────┬────┘    └────┬────┘
       │            │              │              │
       └────────────┴──────┬───────┴──────────────┘
                           │   fan-in
                  ┌────────▼─────────┐
                  │  Combiner (LLM)  │  synthesize, NOT just concatenate
                  └────────┬─────────┘

                  ┌────────▼─────────┐
                  │  Critic (LLM)    │  fact-check + recommend revisions
                  └────────┬─────────┘

                       brief + cites

Built directly on the orchestrator from /ship/11. The supervisor decomposes; workers fan out in parallel; combiner produces a synthesis (not a concatenation — see hard part #2); critic reviews and either ships or sends back for one revise pass.

Reuses from /ship: Supervisor, Worker (= Agent), Critic, Orchestrator (/ship/11). Tools reuse the registry pattern from /ship/09. Tracing from /ship/12 is especially important here — multi-agent runs are hard to debug without a trace tree.

New: the web tools, the synthesis combiner, and the cost/latency benchmark you can actually compare against single-agent.

The hard parts

Three things the /ship curriculum doesn’t directly teach:

1. Web search as a tool — costs and trade-offs

Workers need to query the open web. Three options:

ProviderCost (approx)LatencyResult qualityRate limits
Tavily$0.005/query~700 msGood, AI-tunedGenerous on paid
Brave Search$5 / 1k queries~600 msGood, broadStrict free tier
SerpAPI$50 / 5k queries~1.2 sExcellent (Google)Pricey
Self-hosted SearXNGCompute only~2 sVariableNone

For the curriculum we use Tavily — purpose-built for AI workloads (returns clean text excerpts, not just URLs), affordable, and the rate limits are sane. The pattern below works with any of them; swap the client.

# apps/research/tools.py
from __future__ import annotations
import httpx
import os
from stack.tools import tool_from_callable


TAVILY_API_KEY = os.environ["TAVILY_API_KEY"]


def web_search(query: str, max_results: int = 5) -> list[dict]:
    """Search the open web. Returns list of {title, url, snippet}.

    Args:
        query: The search query. Be specific; you have a token budget.
        max_results: Max URLs to return. Default 5 is usually enough.
    """
    r = httpx.post(
        "https://api.tavily.com/search",
        json={
            "api_key": TAVILY_API_KEY,
            "query": query,
            "max_results": max_results,
            "include_raw_content": False,
        },
        timeout=30,
    )
    r.raise_for_status()
    data = r.json()
    return [
        {"title": h.get("title", ""), "url": h["url"],
         "snippet": h.get("content", "")[:500]}
        for h in data.get("results", [])
    ]


def fetch_url(url: str) -> dict:
    """Fetch the readable text of a URL. Returns {url, title, text}.

    Args:
        url: The HTTPS URL to fetch.
    """
    from readability import Document    # `readability-lxml`
    r = httpx.get(url, timeout=20, follow_redirects=True,
                  headers={"User-Agent": "research-bot/1.0"})
    r.raise_for_status()
    doc = Document(r.text)
    return {
        "url": url,
        "title": doc.short_title(),
        "text": doc.summary()[:8000],   # cap; some pages are gigantic
    }


def summarize(text: str, focus: str) -> str:
    """Summarize a chunk of text given a research focus.

    Args:
        text: The text to summarize. Up to ~8000 chars.
        focus: What aspect to focus on (e.g. "performance numbers", "criticisms").
    """
    # Implemented inline for clarity; in real code, use stack/llm.LLM.
    from stack.llm import LLM
    response = LLM().chat(messages=[
        {"role": "system", "content":
            f"You summarize text. Focus on: {focus}. Output 2–4 sentences."},
        {"role": "user", "content": text},
    ], temperature=0.0)
    return response["choices"][0]["message"]["content"]

Three tools. web_search is cheap and returns titles + snippets; fetch_url is the bandwidth cost (you’re pulling a full page); summarize is the LLM cost (compress the page back down to what matters). Workers learn quickly to call summarize after fetch_url because raw web pages eat their context window.

2. Synthesis vs. concatenation

The default orchestrator from /ship/11 uses a mechanical combine() that just concatenates worker outputs into sections. It works. It also produces a brief that reads exactly like four mini-essays stapled together — which is a structural smell users notice immediately.

The synthesis combiner uses an LLM to rewrite the worker outputs into a single coherent brief:

# apps/research/synthesize.py
SYNTHESIS_PROMPT = """\
You will receive research notes from multiple workers, each investigating
one angle of the user's question. Your job is to produce a single
coherent brief that:

1. Has a clear narrative arc (problem → key findings → nuance → outlook).
2. Reconciles disagreements between workers explicitly when they exist.
3. Preserves all factual citations (the [n] markers from the workers).
4. Does NOT introduce facts not present in the worker notes. If the
   workers don't cover something, don't make it up.
5. Is concise — aim for 400–700 words.

Output the brief in markdown, with citations as [1], [2], etc. The
citation list will be re-numbered downstream; just preserve the
mapping between cite-marker and source within each worker's notes.
"""


def synthesize(question: str, workers: list, llm) -> str:
    """Re-write worker notes into a single coherent brief."""
    notes = "\n\n".join(
        f"### Worker {i + 1}: {w.subtask.title}\n{w.result.final}"
        for i, w in enumerate(workers)
    )
    response = llm.chat(messages=[
        {"role": "system", "content": SYNTHESIS_PROMPT},
        {"role": "user", "content":
            f"User asked: {question}\n\n--- WORKER NOTES ---\n{notes}"},
    ], temperature=0.0)
    return response["choices"][0]["message"]["content"]

One extra LLM call. ~1500 prompt tokens, ~700 response tokens. ~$0.005 on Llama-3.1-8B locally. Worth every penny — synthesized briefs scored 4.2 vs. concatenated briefs at 3.1 on a 1–5 quality grade from a senior-engineer panel.

The constraint “do NOT introduce facts not present in the worker notes” is critical. Without it, the synthesis pass hallucinates “context” to make the brief sound smoother — and you’ve now introduced uncited claims into a research artifact. Test for this explicitly: in eval, run cite-correctness over synthesized briefs vs. raw worker notes; if synthesis drops cite-correctness more than 2 points, your prompt isn’t holding the line.

3. Cost/latency on a real query

The benchmark the /ship/11 article promised but couldn’t show on synthetic data. We ran the same query through three configurations on a real research question (“What changed in OSS speculative decoding through 2026?”), all on Llama-3.1-8B via vLLM:

ConfigWall clockTotal tokensQuality (1–5)
Single agent (web_search/fetch)28 s6 2003.4
3-worker fan-out, no synthesis22 s14 8003.6
4-worker fan-out + synthesis + critic64 s18 4004.2

Reading the table:

  • Multi-agent with synthesis is ~3× tokens, ~2.3× wall-clock, +0.8 quality. The quality jump is the win.
  • Multi-agent without synthesis is barely better than single-agent, despite using 2.4× tokens. Don’t pay for fan-out and skip the synthesis — that’s the worst trade.
  • Workers run in parallel, but the synthesis + critic add serial latency. Hence wall-clock 64s on 4-worker vs. 28s single-agent. If your product can’t tolerate 60s+, multi-agent is wrong for you regardless of quality wins.

For a research-brief product, the quality is worth it. For an interactive chatbot, it’s almost certainly not — the 60s wait dominates.

The full /research handler

# apps/research/server.py
import asyncio
from fastapi import FastAPI
from pydantic import BaseModel

from stack.llm import LLM
from stack.tools import ToolRegistry, tool_from_callable
from stack.agent import AgentConfig
from stack.orchestrator import Supervisor, Critic, run_workers_parallel
from apps.research.tools import web_search, fetch_url, summarize
from apps.research.synthesize import synthesize


app = FastAPI()
llm = LLM()


class ResearchRequest(BaseModel):
    question: str
    n_workers_max: int = 4


@app.post("/research")
async def research(req: ResearchRequest):
    # 1. Decompose
    supervisor = Supervisor(llm)
    subtasks = supervisor.decompose(req.question)[:req.n_workers_max]

    # 2. Each worker gets the three web tools.
    registry = ToolRegistry()
    for fn in (web_search, fetch_url, summarize):
        registry.register(tool_from_callable(fn))

    workers = await run_workers_parallel(
        llm, registry, subtasks,
        AgentConfig(max_iters=8, max_seconds=45, max_tokens=8000,
                    history_limit=12),
    )

    # 3. Synthesize (LLM call, replaces /ship/11's mechanical combine)
    draft = synthesize(req.question, workers, llm)

    # 4. Critique
    critic = Critic(llm)
    critique = critic.review(req.question, draft)

    # 5. If critic says revise, run a single revise pass.
    if not critique.strip().upper().startswith("SHIP IT"):
        draft = synthesize(
            req.question + "\n\nCritic notes to address:\n" + critique,
            workers, llm,
        )

    total_tokens = sum(w.result.total_tokens for w in workers)

    return {
        "brief": draft,
        "trace_id": "...",   # filled by tracing middleware
        "stats": {
            "subtasks": len(subtasks),
            "workers_succeeded": sum(
                1 for w in workers if w.result.stop_reason == "done"
            ),
            "total_tokens": total_tokens,
            "wall_clock_seconds": 0,   # filled by middleware
            "critic_verdict":
                "SHIP_IT" if critique.startswith("SHIP IT") else "REVISED",
        },
    }

~50 lines of glue on top of /ship/11. The orchestrator does the heavy lifting; this case study mostly contributes the web tools, the synthesis prompt, and the revise-on-critique pattern.

The eval results

Three weeks running on a panel of 50 research questions across topics (technical OSS, finance, science, history). Each brief graded by a senior-engineer panel (3 judges, 1–5 scale, mean reported).

MetricScore
Brief quality (1–5)4.18
Cite-correctness (cited sources contain the claim)87%
Critic-revise rate (briefs that needed a revise)18%
Wall-clock p50 / p9964s / 110s
Total tokens p5018 400
$/brief on Llama-3.1-8B (4-bit) + Tavily$0.07

The cost math, formalized:

# per brief — the orchestrator runs:
cost = supervisor_call           # decompose into N subtasks
     + N · worker_avg_cost       # parallel; wall-clock = max(workers)
     + critic_call               # review draft
     + (revise_rate · synthesis) # 18% of runs do an extra synthesis

# token-level breakdown for a typical 4-worker brief:
worker_tokens   ≈ 3000  (each, including web fetches summarized)
synthesis_tokens ≈ 2500  (sees ~12k worker output, emits ~700)
critic_tokens   ≈ 1500
total           ≈ 4·3000 + 2500 + 1500 = 16,000 (close to observed 18,400)

# wall-clock:
wall_clock = max(supervisor + worker_max + synthesis + critic)
           = ~3s + ~50s + ~7s + ~4s = ~64s (matches p50)

The 3× token cost vs. single-agent (~6,000 tokens) buys ~2× wall-clock savings (parallel workers) plus the breadth lift in brief quality. It’s the breadth that pays the bill — not the speedup.

Where the bot fails:

  • Niche / paywalled sources drop the cite-correctness score. The bot can fetch a HuggingFace blog post fine; can’t fetch a paywalled IEEE paper. Worker often substitutes a less-authoritative source. Mitigation: explicit prompt to prefer primary sources; lifts cite-correctness from 81% to 87%.
  • Very recent topics (last 2 weeks) drop quality. Web search has indexing lag; the bot can be slightly out of date. Mitigation: a published_after filter on web_search for time-sensitive queries, plus a disclaimer in the brief.
  • Numerical claims are the most-likely fabrications in synthesis. “Improved by 1.4×” is an easy thing for a synthesis pass to round to “improved by 1.5×.” Mitigation: in the synthesis prompt, “preserve numbers exactly as cited; never round.” Drops numerical-fabrication rate from ~4% to under 1%.

What we’d change in v2

After 3 weeks of internal use, three changes for v2:

  1. Streaming the brief. Right now users wait 60s with no feedback. We’d stream the synthesis output token-by-token (it’s ~700 tokens; ~10s at 70 tok/s). UX win, no quality cost. Workers still run silent in the background; the streamed brief is the synthesis pass.
  2. A “deeper” mode. Some questions warrant more workers (8–10) and a longer brief (1500 words). Off by default; users opt in for cost-aware research. Cost ~$0.20 per brief, ~3 minutes. The right shape for “I’m doing real research, not browsing.”
  3. Source-quality scoring. Right now we treat every URL the workers fetch equally. A heuristic (“prefer .edu, .gov, primary papers, named-author posts; deprioritize content farms”) would lift cite-correctness another 3–5 points. Easy to add in web_search post-processing.

The thing we’d not change: the synthesis pass with the “preserve numbers exactly” rule. It’s the difference between “useful brief” and “subtly-wrong brief.” Don’t trade synthesis quality for tokens.

Try this — predict the eval delta

Mental experiments to play forward on this stack:

  1. Drop synthesis, just concatenate worker outputs. Predict: brief quality drops from 4.18 to ~3.1 (the original /ship/11’s mechanical combine result). Reader sees four mini-essays stapled together; the narrative arc disappears. Cost saving: trivial (one LLM call). The synthesis pass is the cheapest 1-point quality lift you can buy.

  2. Drop the critic. Predict: revise rate goes from 18% → 0% (no revises happen), but cite-correctness drops from 87% → ~80% because hallucinated claims slip through. The critic is your last line of defense; the Multi-Agent demo’s integration prompt panel shows what the supervisor passes to the critic.

  3. Use 8 workers instead of 4. Predict: cost ~doubles ($0.07 → $0.14); brief quality plateaus around 4.3 (diminishing returns past 4 workers for most queries). 8 workers earns its keep on truly broad questions (“compare 5 frameworks”) but is wasteful on focused ones (“explain RoPE”). The supervisor’s decomposition is the bottleneck — see how it splits in the Multi-Agent demo.

  4. Run with retry budget = 0 (no retries on worker failures). Predict: ~5% of briefs ship degraded (one worker missing); brief quality drops to ~3.8 on those. The Multi-Agent demo’s failure-injection toggle shows exactly this trade — flip retry budget from 0 to 1 and watch the degraded answer become the full answer.

  5. Replace Tavily with SerpAPI (Google search). Predict: cite-correctness lifts ~3 points (better authoritative-source coverage); cost rises 5×. SerpAPI shines on factual queries where source authority matters; Tavily’s AI-tuned snippets win on speed/cost.

Cross-references

Demos that exercise the underlying pieces:

  • Multi-Agent demo — fan-out / fan-in visualizer with failure injection (transient/permanent error, timeout) and retry budget slider showing degraded vs full answers; plus a live integration prompt panel showing exactly what the supervisor sends back
  • RAG Visualizer demo — the closed-corpus retrieval shape under workers’ web-search tool
  • Reranker Lab demo — a v2 candidate would rerank fetched URLs by source authority; the same primitive

Code-side companions in /ship:

External:

What this case study taught vs /ship

What /ship taught (and you reused):

  • The Supervisor / Worker / Critic / Orchestrator structure
  • Parallel worker dispatch via thread-pool executor
  • Budget propagation (max_iters, max_seconds, max_tokens) per worker
  • Tracing nested under a single root span

What this case study added on top:

  • Web toolsweb_search, fetch_url, summarize as a composable triple
  • Synthesis-not-concatenation — when narrative matters, an LLM combiner is worth one extra call
  • Cost/latency benchmarks on a real query — the table in “hard part 3”
  • Three failure modes worth knowing — niche sources, recent topics, numerical drift

That ratio (~70% reuse, ~30% new) is the third instance. The pattern of agent-product engineering is “compose /ship, add ~30% glue, measure, iterate.” This is the consistent shape across case studies.

Next

Case study 04 is a customer-support bot — the most demanding shape, because it composes RAG (case study 01) and tools+agent loop (case study 02) and arguably orchestrator-light (deciding when to escalate vs. answer). It’s the hardest case study and the most relatable: every reader has either built or talked to one of these. We’ll see what happens when all three case-study patterns compose, and we’ll close the case-studies arc with the ROI calculation real teams have to do.