Cross-Stage Projects

Bigger worked projects spanning multiple stages. The kind of thing you can read end-to-end and copy as a starting point for your own work.

These are sketches, not turn-key code — adapt to your stack.

Project 1 — RAG over your own notes (Stages 5, 8, 9, 13)

A knowledge-search app over personal notes / docs.

Architecture:
  notes/*.md
     ↓ chunk (recursive char splitter, ~512 tokens, 50 overlap)
     ↓ embed (bge-large-en-v1.5)
     ↓ store (pgvector or ChromaDB)
     ↓ retrieve (hybrid: dense + BM25, RRF fuse)
     ↓ rerank (bge-reranker-v2-m3, top 50 → top 5)
     ↓ generate (Claude Haiku, prompt with strict citation)
     ↓ verify (citation exists in retrieved set)
     ↓ output: answer + sources

See Stage 9 solutions for the working code.

What to add for production:

Observability with Langfuse (Stage 13).
50-query golden set with recall@10 + faithfulness measurements.
Faithfulness LLM-judge runs on a 1% sample of prod traffic.
Prompt cache for the system prompt + tool definitions.
Per-user rate limiting + cost cap.

Project 2 — Multi-hop research agent (Stages 8, 9, 11)

Agent that answers questions requiring multiple lookups.

# See stage-11.md for the loop. Add these tools:

tools = [
    web_search,        # broad search
    read_url,          # extract a page
    search_kb,         # search your private RAG
    finalize,          # explicit terminator with answer + citations
]

# System prompt:
SYSTEM = """You are a research assistant. Use web_search and read_url to find
information. Use search_kb when a question is about internal knowledge.

Always cite sources in your final answer using [1], [2], etc., listing them
at the bottom. If sources disagree, note that. If you can't find the answer
after 3 tries, say so honestly.

Use the finalize tool when you have a complete answer."""

Variations:

Add a planner that runs first (Stage 11 — planning).
Run multiple queries in parallel with asyncio.gather.
Output structured JSON for downstream consumption.
Connect to Slack as a bot.

Project 3 — Vertical agent: PR reviewer (Stages 8, 9, 11, 14)

Automated code review on a GitHub PR.

Inputs:
  - PR diff (from gh CLI)
  - Repo context (CLAUDE.md / README / conventions)
  - Files touched (full bodies for context)

Steps:
  1. Plan: identify what the PR is doing.
  2. Read: pull each touched file in full.
  3. Inspect: look for bugs, style violations, missing tests.
  4. Run: lint + type-check + tests in a sandbox.
  5. Suggest: post comments on the PR.

Tools:
  - gh CLI for fetching the diff and posting comments
  - Repository file reader (with size cap)
  - Shell tool (sandboxed) for running tests
  - MCP server for static analysis

Patterns from Stage 14 — text-to-code:

Verifier loop: run tests; if failing, ask the model to investigate.
Confirmation gate before posting comments (manual approval initially).
Per-PR token budget.
Always read the full file, not just the diff — context for review.

Project 4 — Domain-tuned Q&A bot (Stages 9, 10)

A chatbot specialized for a domain via RAG + LoRA.

Phase 1: RAG baseline
  - Build a RAG over the domain corpus.
  - Build a 100-query eval set with expected answers.
  - Measure end-to-end correctness on the baseline.

Phase 2: Synthetic data for fine-tuning
  - For each chunk, have an LLM generate 3 (question, answer) pairs.
  - Filter: keep pairs where the answer is supported by the chunk.
  - Result: ~10k high-quality (q, a) pairs.

Phase 3: LoRA fine-tune
  - LoRA-fine-tune a 7B model (e.g. Qwen3-7B-Instruct) on the synthetic data.
  - Use TRL or Axolotl. ~$10–$30 of compute on RunPod / Modal.
  - Hold out 10% for eval.

Phase 4: Compare
  - Frontier model + RAG (baseline).
  - Fine-tuned 7B + RAG.
  - Fine-tuned 7B alone (no RAG).

Often: fine-tune + RAG matches or beats frontier + RAG, at 10× cheaper inference.

Things to watch:

Quality of synthetic pairs is everything.
Mix in some general instruction data to avoid catastrophic forgetting.
Eval must be held out from synthesis.
Inference cost: serve fine-tuned model with vLLM.

Project 5 — Voice agent for scheduling (Stages 11, 12, 13)

A voice-driven scheduling assistant.

Pipeline:
  user voice
    ↓ Whisper Large v3 Turbo (streaming ASR)
    ↓ VAD detects end-of-utterance
    ↓ Claude with tools: get_calendar, propose_time, send_invite
    ↓ ElevenLabs streaming TTS
    ↓ user hears reply

Key engineering:
  - Sub-700ms end-to-end latency p95
  - Interruption handling (cut TTS when user speaks)
  - Tool use with confirmation for actions (sending invites)
  - Fallback to typed input on ASR failure

Stack candidates (early 2026):

ASR: Whisper Large v3 Turbo (self-hosted) or Deepgram (managed).
LLM: Claude Sonnet 4.6 with prompt caching.
TTS: ElevenLabs (streaming).
Glue: Vapi, Retell, or custom WebSocket server.

Or: OpenAI Realtime / GPT-4o for end-to-end voice without separate ASR/TTS — lower latency, less control.

Project 6 — Daily news brief (Stages 9, 11, 13)

A scheduled agent that produces a personalized daily briefing.

Daily 6am job:
  1. For each topic the user follows:
     - search_news(topic, since=yesterday)
     - rank by importance (LLM judge)
     - keep top 3
  2. For each kept article:
     - read_url
     - summarize (2 sentences)
  3. Compile briefing as markdown
  4. Email or Slack

Personalization:
  - User follows topics: { "AI safety", "Anthropic", "open-source LLMs" }
  - User preferences: brief style, length cap, "skip these sources"

Patterns:

Scheduled background agent (no real-time interaction).
Output is a file, not a chat — different UX.
Cost is bounded by topics × frequency; predictable.
Easy to share-out: a static daily HTML page.

Project 7 — On-device summarizer (Stages 7, 12, 13)

A summarization tool running fully on a Mac via MLX.

Stack:
  - mlx-lm for inference (Apple Silicon)
  - 4-bit quantized Llama-3.2-3B or Phi-4
  - Local web UI (FastAPI + simple HTML)
  - No network calls — fully offline

Use cases:
  - Summarize a long email
  - Extract action items from a meeting transcript
  - Bullet-point a long article

Why this matters:

Privacy: no data leaves the device.
Cost: zero per-call.
Latency: instant.
Demonstrates the on-device frontier (Stage 7 — frontier architectures).

Project 8 — Eval harness (Stages 9, 13)

Infrastructure project: build a reusable LLM eval harness.

Features:
  - Datasets: load (q, expected, metadata) tuples from JSON/CSV.
  - Runners: invoke any LLM (LiteLLM-style abstraction).
  - Metrics: exact match, F1, semantic similarity, LLM-judge.
  - Reporting: per-case + aggregate; HTML output; W&B integration.
  - CI: run on PR; block if regression beyond threshold.
  - History: track metric trends over time.

Look at promptfoo, DeepEval, Inspect for inspiration. Building one from scratch teaches what evals actually require.

Bonus: open-source it.

Choose your project

For each project above, ask:

Does it match my career direction (Track A / B / C)?
Is it small enough to ship in 2–4 weeks?
Is it useful enough to actually use after I ship it?

Pick one. Ship it. Write up what you learned. That’s the most important learning step in the entire path.