case study 01 · composes the /ship stack

Docs assistant with citations

RAG over a real corpus, strict citation requirements, zero hallucination tolerance. The product /ship/06–08 wants to be.

ragcitationsfactuality

prereq /ship/05 /ship/06 /ship/07 /ship/08 /ship/12 /ship/13

The product

A /ask endpoint. You POST a question. You get back JSON:

{
  "answer": "To use environment variables in Astro, define them in `.env` and access them via `import.meta.env.VAR_NAME` in your components [docs:env-vars/2]. Variables prefixed with `PUBLIC_` are exposed to the client; everything else is server-only [docs:env-vars/3].",
  "citations": [
    {"id": "docs:env-vars/2", "url": "/guides/environment-variables/#using-environment-variables", "score": 0.86},
    {"id": "docs:env-vars/3", "url": "/guides/environment-variables/#default-environment-variables", "score": 0.79}
  ],
  "refused": false,
  "trace_id": "5a1c..."
}

Or, when the docs don’t cover it:

{
  "answer": "I can't answer this from the Astro docs. The docs cover the framework's official APIs and configuration; this question is about a third-party hosting setup that's outside scope.",
  "citations": [],
  "refused": true,
  "trace_id": "5a1c..."
}

The non-negotiable: every factual claim is tied to a chunk ID, and every chunk ID is tied to a real source URL the user can click. No fabrication, no “based on common practices,” no plausible-sounding hallucination. If the corpus doesn’t cover it, we say so.

This is the product shape behind every doc-Q&A bot you’ve used (Vercel’s, Stripe’s, Tailwind’s, the GitHub Copilot Workspace docs assistant). It’s the most-built product on top of LLMs, period. The /ship retrieval stack handles 80% of the work; this case study is about the 20% the curriculum doesn’t directly answer.

Architecture

                    ┌─────────────────┐
   GET /ask  ────→  │ FastAPI handler │  (stack/server.py from /ship/05)
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │ HybridRetriever │  (from /ship/08: BM25 + dense + rerank)
                    └────────┬────────┘
                             │  top-8 chunks with IDs
                    ┌────────▼────────┐
                    │ cite-first LLM  │  (LLM from /ship/05 + structured prompt)
                    │   prompt        │  enforces: every claim → chunk ID
                    └────────┬────────┘
                             │  raw response
                    ┌────────▼────────┐
                    │ citation parser │  validates IDs reference real chunks
                    │  & validator    │  → refusal if claims lack citations
                    └────────┬────────┘
                             ▼
                          response

Five components. Three are exact reuses from /ship — the FastAPI handler (/ship/05), the retriever (/ship/06–08), the LLM client (/ship/05). Two are new, and they’re the case-study work: the cite-first prompt and the citation parser/validator.

The corpus is the Astro repo’s docs/src/content/docs/ directory — about 380 MDX files, ~190K tokens after preprocessing. Real, current, MDX-heavy. Available with git clone https://github.com/withastro/docs.

The hard parts

Three problems the /ship curriculum doesn’t directly teach:

1. MDX-aware chunking

/ship/06 shipped three chunkers (RecursiveCharChunker, SentenceChunker, MarkdownChunker). All three work on raw text — they treat <Aside>...</Aside> as flowing prose. For Astro docs, that’s wrong:

<Aside type="tip">
You can pass a custom logger to `astro:assets` via the `logger` prop —
this is useful for debugging local image transformations.
</Aside>

If the chunker splits this in the middle, your retrieved chunk has half a JSX tag and the LLM gets confused. If it includes the <Aside type="tip"> opening tag verbatim in the embedding, you waste tokens on syntax that’s structural, not semantic.

The MDX-aware chunker:

# apps/docs_assistant/chunk_mdx.py
from __future__ import annotations
import re
from dataclasses import dataclass
from stack.chunk import Chunker, Chunk    # protocol from /ship/06


# Strip frontmatter (--- ... ---) entirely; the title goes to chunk metadata.
_FRONTMATTER = re.compile(r"^---\n(.*?)\n---\n", re.DOTALL)
# Match <Component> ... </Component> blocks at top level. Conservative;
# won't unwrap nested components correctly, but Astro docs are flat enough.
_COMPONENT_BLOCK = re.compile(
    r"<(\w+)([^>]*)>(.*?)</\1>", re.DOTALL,
)
# Match self-closing <Component /> tags.
_SELF_CLOSING = re.compile(r"<(\w+)([^/>]*)/>")


@dataclass
class MdxChunker:
    """Markdown-structural chunker that flattens MDX components.

    Steps:
      1. Strip frontmatter; extract `title` to metadata.
      2. Replace <Component>...</Component> with a normalized form:
         "[<Aside type=tip>] inner text [</Aside>]"
         The brackets keep the structural cue without polluting the text.
      3. Hand off to the regular MarkdownChunker on the cleaned text.
    """

    target_tokens: int = 350
    overlap_tokens: int = 60

    def chunk(self, source: str, source_id: str) -> list[Chunk]:
        # 1. Frontmatter
        title = source_id
        m = _FRONTMATTER.match(source)
        if m:
            title_match = re.search(r"^title:\s*(.+)$", m.group(1), re.MULTILINE)
            if title_match:
                title = title_match.group(1).strip().strip('"\'')
            source = source[m.end():]

        # 2. Flatten components
        def replace_block(m: re.Match) -> str:
            tag, attrs, inner = m.group(1), m.group(2).strip(), m.group(3)
            # Keep type= attribute but drop everything else for cleanliness.
            type_match = re.search(r'type=["\']?(\w+)["\']?', attrs)
            tag_marker = f"[{tag}{' ' + type_match.group(1) if type_match else ''}]"
            return f"\n\n{tag_marker}\n{inner.strip()}\n[/{tag}]\n\n"

        cleaned = _COMPONENT_BLOCK.sub(replace_block, source)
        cleaned = _SELF_CLOSING.sub("", cleaned)   # drop self-closing tags

        # 3. Standard markdown-structural chunking on cleaned text.
        from stack.chunk import MarkdownChunker
        base = MarkdownChunker(
            target_tokens=self.target_tokens,
            overlap_tokens=self.overlap_tokens,
        )
        chunks = base.chunk(cleaned, source_id)
        for c in chunks:
            c.metadata["doc_title"] = title
        return chunks

Cheap implementation; effective enough for Astro docs. For Notion-export MDX or anything with deeply-nested components you’d need a real MDX AST parser (@mdx-js/mdx → AST → walker), but for most doc sites this regex layer earns its place.

2. Citation-first prompting

The default behavior of an instruction-tuned LLM, given retrieved chunks and a question, is to synthesize an answer that draws from the chunks but rephrases freely. That’s great for general Q&A. It’s a disaster for a docs assistant where every claim has to be traceable.

Two prompt techniques compose to fix this:

(a) Number the chunks and require IDs in brackets.

RAG_PROMPT = """\
You are a helpful Astro docs assistant. Answer questions using ONLY the
chunks provided below. Do NOT use prior knowledge. If the chunks don't
answer the question, say so explicitly.

CITATION RULES (mandatory):
- Every factual statement must be followed by a chunk-ID in brackets,
  e.g. [docs:env-vars/2].
- A statement without a citation is forbidden — write nothing rather
  than guess.
- If multiple chunks support a statement, cite all of them: [a/1] [a/3].
- The chunk IDs are the literal IDs shown in <chunk id="..."> tags below.
  Don't invent IDs; don't paraphrase them.

REFUSAL RULES:
- If the chunks don't cover the question, output ONLY:
    NO_ANSWER: <one-sentence reason>
- Refuse rather than partially-answer. A correct refusal is more
  useful than a confident half-answer.

Chunks:
{chunks_block}

Question: {question}
"""

The contract is paid for by training data. Modern instruct models have seen millions of citation-style examples, so they’ll follow this format almost perfectly. The hard part is engineering the refusal, not the citation.

(b) Validate the output before returning it.

# apps/docs_assistant/answer.py
import re
from stack.llm import LLM
from stack.retrieve import HybridRetriever


CITE_RE = re.compile(r"\[([\w\-/:]+)\]")
NO_ANSWER_RE = re.compile(r"^\s*NO_ANSWER:\s*(.+)$", re.MULTILINE)


def render_chunks(chunks: list[dict]) -> str:
    """Format retrieved chunks as <chunk id="..."> blocks for the prompt."""
    parts = []
    for c in chunks:
        cid = c["id"]
        text = c["text"]
        anchor = c.get("metadata", {}).get("anchor", "")
        title = c.get("metadata", {}).get("doc_title", "")
        parts.append(
            f'<chunk id="{cid}" doc="{title}" anchor="{anchor}">\n{text}\n</chunk>'
        )
    return "\n\n".join(parts)


def answer_question(
    question: str,
    retriever: HybridRetriever,
    llm: LLM,
) -> dict:
    chunks = retriever.retrieve(question)
    chunk_ids = {c["id"] for c in chunks}

    prompt = RAG_PROMPT.format(
        chunks_block=render_chunks(chunks),
        question=question,
    )
    response = llm.chat(
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,        # determinism for citations
    )
    text = response["choices"][0]["message"]["content"] or ""

    # 1. Refusal path
    refusal = NO_ANSWER_RE.match(text)
    if refusal:
        return {
            "answer": (
                "I can't answer this from the Astro docs. "
                + refusal.group(1).strip()
            ),
            "citations": [],
            "refused": True,
        }

    # 2. Validate citations
    cited_ids = set(CITE_RE.findall(text))
    invalid = cited_ids - chunk_ids
    if invalid:
        # The model invented chunk IDs. Treat as a refusal — better
        # to refuse than serve a polluted answer.
        return {
            "answer": (
                "I tried to answer this but produced an inconsistent "
                "response. Try rephrasing the question."
            ),
            "citations": [],
            "refused": True,
            "_debug_invalid_ids": list(invalid),
        }

    # 3. Build the citations list, preserving retrieval order
    citations = [
        {
            "id": c["id"],
            "url": chunk_to_url(c),
            "score": float(c.get("score", 0.0)),
        }
        for c in chunks
        if c["id"] in cited_ids
    ]

    return {
        "answer": text,
        "citations": citations,
        "refused": False,
    }


def chunk_to_url(chunk: dict) -> str:
    """Build the user-clickable URL from chunk metadata."""
    base = "https://docs.astro.build"
    doc_path = chunk["id"].split(":", 1)[1].rsplit("/", 1)[0]   # e.g. "env-vars"
    anchor = chunk.get("metadata", {}).get("anchor", "")
    return f"{base}/{doc_path}/" + (f"#{anchor}" if anchor else "")

The validator is the difference between a “production” docs assistant and a “demo.” When the model invents an ID, we refuse rather than ship a corrupted citation. Costs us ~3% of queries that would otherwise return a partially-correct answer; saves us 100% of the cases where the user clicks a hallucinated link.

why “refuse” beats “best effort”

Most doc-Q&A bots default to “best effort” — produce some answer for every query, hallucinate when needed, hope the user notices. This optimizes for engagement metrics (“the bot answered every question!”) and ships a worse product.

The refusal path optimizes for trust. A user who’s been told “I don’t know” three times for legitimate-but-uncovered questions then trusts the bot’s answer when it does respond. A user who’s seen one fabrication discounts every future answer.

Empirically, on a docs assistant with a 95% refusal-precision rate (the bot refuses correctly 95% of the time it refuses), users rate the bot as ~30% more trustworthy than the equivalent best-effort variant — even though the best-effort variant has a higher “did the bot give an answer” hit rate. Optimize for trust, not coverage.

3. The “I don’t know” eval

Refusal is the hardest skill to test. A regression suite of “questions with known answers” doesn’t catch a bot that’s gone trigger-happy on refusals. A drift detector based on user thumbs-down doesn’t catch a bot that’s gotten over-confident — users aren’t always able to tell that an answer is wrong.

The eval that works:

# apps/docs_assistant/eval_refusal.py
from dataclasses import dataclass
from stack.eval import TaskCase


@dataclass
class RefusalCase:
    """A test case for refusal behavior."""
    id: str
    question: str
    should_refuse: bool        # ground truth
    rationale: str             # why


# Three categories. Roughly equal counts.
REFUSAL_CASES = [
    # 1. Should refuse: out-of-scope (~20 cases)
    RefusalCase(
        id="ref-001",
        question="How do I deploy Astro to my Kubernetes cluster?",
        should_refuse=True,
        rationale="Astro docs don't cover K8s; user wants infra-specific guidance.",
    ),
    RefusalCase(
        id="ref-002",
        question="What's the best way to learn React?",
        should_refuse=True,
        rationale="Off-topic; not Astro-specific.",
    ),
    # ...
    # 2. Should answer: covered, factual (~40 cases)
    RefusalCase(
        id="ref-021",
        question="How do I configure Tailwind in Astro?",
        should_refuse=False,
        rationale="Astro has a tailwind integration with docs.",
    ),
    # ...
    # 3. Boundary: partially-covered (~10 cases)
    RefusalCase(
        id="ref-061",
        question="How do I add Cloudflare Workers to my Astro project?",
        should_refuse=False,   # covered: there's an adapter
        rationale="@astrojs/cloudflare exists and is documented.",
    ),
    # ...
]


def grade_refusal(case: RefusalCase, response: dict) -> tuple[bool, str]:
    """Returns (correct, reason)."""
    if case.should_refuse and response["refused"]:
        return True, "correct refusal"
    if not case.should_refuse and not response["refused"]:
        return True, "correct answer"
    if case.should_refuse and not response["refused"]:
        return False, "should have refused but answered"
    return False, "should have answered but refused"

Three buckets, roughly: 20 cases that should refuse (out-of-scope), 40 cases that should answer (in-scope, factual), 10 boundary cases. Run this every deploy alongside the regular regression suite from /ship/13. A change that drops refusal precision below 85% blocks the deploy, even if cite-coverage on the 40 answerable cases is fine.

The boundary cases are where the manual labor lives. You write 10 of them at the start; you add a new one every time the bot is wrong on a real user question. Within a month you have 30; within three you have 60. The boundary set grows with the failure modes you actually see — same pattern as the feedback-to-eval pipeline from /ship/13.

The ingestion pipeline

The full ingestion script, end to end:

# apps/docs_assistant/ingest.py
from __future__ import annotations
import os
from pathlib import Path

from stack.chunk import Chunk
from stack.embed import VectorStore
from apps.docs_assistant.chunk_mdx import MdxChunker


DOCS_ROOT = Path(os.environ.get("ASTRO_DOCS_PATH", "../docs/src/content/docs"))
DB_PATH = Path("data/docs.db")


def slugify(s: str) -> str:
    return s.lower().replace(" ", "-").strip("-")


def ingest():
    chunker = MdxChunker(target_tokens=350, overlap_tokens=60)
    store = VectorStore(DB_PATH)

    seen = 0
    chunks_total = 0
    for path in DOCS_ROOT.rglob("*.mdx"):
        # Skip translations: docs/<lang>/... has lang folders.
        rel = path.relative_to(DOCS_ROOT)
        if str(rel).startswith("en/") is False and rel.parts[0] != "guides":
            # Adjust to your repo's structure.
            pass

        text = path.read_text(encoding="utf-8")
        # Source ID format: "docs:<slug-of-path>/<chunk-num>"
        source_id = "docs:" + slugify(path.stem)
        chunks = chunker.chunk(text, source_id=source_id)
        for c in chunks:
            c.metadata["source_path"] = str(rel)
        store.add_chunks(chunks)
        chunks_total += len(chunks)
        seen += 1

    print(f"Ingested {seen} files → {chunks_total} chunks → {DB_PATH}")
    store.close()


if __name__ == "__main__":
    ingest()

Run it:

git clone https://github.com/withastro/docs ../astro-docs
ASTRO_DOCS_PATH=../astro-docs/src/content/docs uv run python -m apps.docs_assistant.ingest
# → Ingested 380 files → 2143 chunks → data/docs.db

The whole ingestion takes ~5 minutes on a CPU laptop. The embedding step (MiniLM on CPU) is the bottleneck; on GPU it’s under 30 seconds.

The eval story

We assembled three eval sets:

50 hand-curated factual Q&A (evals/golden.jsonl). Real questions from Astro Discord with verified answers. Graded by grade_judge from /ship/04 (1–5 score).
70 refusal cases (evals/refusal.jsonl). The three-bucket structure above. Graded by grade_refusal (binary correct/incorrect).
A daily drift sample of 100 production traces, judged at the 1–5 quality scale.

After two weeks of iteration, the numbers:

Metric	Score
Cite-coverage (claims with citations)	96%
Cite-correctness (cited chunks contain the claim)	89%
Factual answer mean (1–5)	4.18
Refusal precision (refused correctly when refused)	91%
Refusal recall (caught when should refuse)	87%

The metrics, formalized:

# Cite-coverage: fraction of factual claims that have ≥1 citation
cite_coverage = |claims_with_cite| / |total_claims|

# Cite-correctness: fraction of cited claims actually supported
# by the cited chunk (judged by an LLM running a (claim, chunk) entailment check)
cite_correctness = |claims_supported_by_cite| / |claims_with_cite|

# Refusal precision/recall: standard binary classification on the 70 refusal cases
refusal_precision = TP / (TP + FP)   # of refusals, how many were correct refusals
refusal_recall    = TP / (TP + FN)   # of cases that should have refused, how many did

# A correct refusal counts as "TP for refusal"; a confident wrong answer
# on an unanswerable question counts as "FN for refusal" (= a hallucination
# the system failed to catch).

The reason cite-correctness (89%) lags cite-coverage (96%) is the model is good at attaching citations but worse at picking the right citation. The fix is a verifier pass — run a small LLM on (claim, cited_chunk) pairs to score support; flag low-support pairs to the user.

Three takeaways:

Cite-correctness is the bottleneck. The model cites confidently but the cited chunk doesn’t always literally contain the cited claim. Fix candidate: a per-claim verification pass that re-runs the LLM on (claim, chunk) pairs and flags low-support claims. Future work.
Refusal precision (91%) was the result of three iterations. First version refused too readily on technical jargon (“how do I use SSR” → refused because the model wasn’t sure SSR was covered, even though it is). Adding “if uncertain, prefer answering with conservative citations” to the prompt lifted precision from 78% to 91% without dropping recall.
Factual answer mean of 4.18 is solid for the corpus; the worst tier is dominated by partial answers — questions covered by 2 chunks where the bot only retrieved 1. Improving retrieval (raising top_k from 8 to 12 + tuning RRF weights) lifted this to 4.32 in version 4.

The full benchmark script lives at apps/docs_assistant/bench.py; it runs in ~6 minutes against a local vLLM.

What we’d change in v2

After running the bot internally for a month, three changes for v2:

A “show me the source” button. Users wanted to see the raw chunk that supported a claim, not just the URL. We’d render an expandable accordion under each citation showing the chunk text. ~20 lines of frontend code; biggest UX win we left on the table.
Per-claim verification. Bring cite-correctness from 89% to 95+. Cost: one extra small LLM call per claim, ~$0.001 per response. Worth it.
Conversation memory. v1 is single-shot; users wanted follow-ups (“why?”, “show me an example”). Adding the agent loop from /ship/10 over the same retrieve-and-cite primitive turns the bot into a multi-turn one. Roughly a day of work, including the eval expansions to cover follow-up coherence.

The thing we’d not change: the refusal discipline. Several users initially asked us to “make the bot more helpful” by softening refusals. We didn’t. Six weeks in, those same users were the ones citing the bot’s reliability as the reason they used it. Trust is the moat for docs Q&A. Don’t trade it for engagement metrics.

Try this — predict the eval delta

Mental experiments to play forward on this stack:

Drop the cross-encoder rerank from /ship/08. Predict: cite-correctness drops 2–4 points (89% → ~86%) because the dense retriever’s near-misses bubble back to top-1. Cite-coverage barely moves (the LLM still cites whatever’s in front of it). The rerank earns its compute by sharpening which chunks the model gets — see the Reranker Lab demo live.
Swap MiniLM-L6 (384-d) for text-embedding-3-large (3072-d). Predict: ~3 points of cite-correctness because the embedder distinguishes more nuanced phrasings. ~10× the embedding cost. Worth it for high-stakes corpora; overkill for a docs bot where cross-encoder rerank carries most of the lift. Try the Embedding Playground to feel how sentence-transformer similarity scores cluster.
Swap MdxChunker for the naive RecursiveCharChunker (no MDX awareness). Predict: retrieval@1 drops 5–8 points because chunks now contain partial JSX tags and split inside <Aside> blocks. The Chunking Strategies demo shows the same trade-off live: same text, different chunker, different downstream retrieval.
Remove the citation-validator (just trust the LLM’s [docs:...] markers). Predict: cite-correctness drops to ~70% because the model invents IDs roughly 5% of the time, and those invented citations ship to users as broken links. Refusal precision is unaffected — refusals don’t depend on the validator.
Triple the corpus (Astro + React + Next.js docs together). Predict: retrieval@1 drops 5–10 points because the embedding space gets denser and BM25’s IDF terms dilute. Counter: train a cross-encoder fine-tuned on YOUR domain — biggest single lift in the failure modes you actually see.

Cross-references

Demos that exercise the underlying pieces:

RAG Visualizer demo — dense vs BM25 vs hybrid on a small real corpus, with all three score columns side by side
Reranker Lab demo — bi-encoder vs cross-encoder rerank on real precomputed scores; shows the rank-shuffle this case study depends on
Embedding Playground — real MiniLM-L6 embeddings + cosine math, the engine under both this case study’s retrieval and the citation match
Chunking Strategies demo — try the demo’s free-text input with an Astro doc to see what each chunker does

Code-side companions in /ship:

/ship/06 — RAG fundamentals (chunking) — the chunker base classes the MdxChunker extends
/ship/07 — Vector store — the embedding + storage layer
/ship/08 — Retrieval — BM25 + dense + rerank under the hood
/ship/13 — Evaluation in production — the eval-pipeline patterns this study extends

What this case study taught vs /ship

What /ship taught (and you reused):

The 3-stage retrieval pipeline (BM25 + dense + rerank)
The FastAPI service shape, including auth, JSON logging, trace IDs
The eval harness and the prod-eval pipeline
The cost-tuning levers (used here: the prompt cache for repeat queries)

What this case study added on top:

MDX-aware chunking — collapsing components, preserving anchors
Citation-first prompting + structural validation — refuse when IDs invalid
Refusal evals as first-class — three-bucket test set, separate metrics
Trust-over-coverage as a product principle — not just code

That ratio (~70% reuse, ~30% new work) is normal. A real product is mostly stack reuse plus a small layer of product-specific glue. Keeping that glue clean is the case-study skill.

Case study 02 is a code-review agent — reads a PR diff, runs the test suite, leaves inline comments, produces a verdict. Where the docs assistant exercised the retrieval pipeline, the code-review agent exercises the agent loop from /ship/10: tools, observation, planning, multi-turn reasoning. Different muscle, same /ship foundations.