Cross-Stage Projects
Bigger worked projects spanning multiple stages. The kind of thing you can read end-to-end and copy as a starting point for your own work.
These are sketches, not turn-key code — adapt to your stack.
Project 1 — RAG over your own notes (Stages 5, 8, 9, 13)
A knowledge-search app over personal notes / docs.
Architecture:
notes/*.md
↓ chunk (recursive char splitter, ~512 tokens, 50 overlap)
↓ embed (bge-large-en-v1.5)
↓ store (pgvector or ChromaDB)
↓ retrieve (hybrid: dense + BM25, RRF fuse)
↓ rerank (bge-reranker-v2-m3, top 50 → top 5)
↓ generate (Claude Haiku, prompt with strict citation)
↓ verify (citation exists in retrieved set)
↓ output: answer + sources
See Stage 9 solutions for the working code.
What to add for production:
- Observability with Langfuse (Stage 13).
- 50-query golden set with recall@10 + faithfulness measurements.
- Faithfulness LLM-judge runs on a 1% sample of prod traffic.
- Prompt cache for the system prompt + tool definitions.
- Per-user rate limiting + cost cap.
Project 2 — Multi-hop research agent (Stages 8, 9, 11)
Agent that answers questions requiring multiple lookups.
# See stage-11.md for the loop. Add these tools:
tools = [
web_search, # broad search
read_url, # extract a page
search_kb, # search your private RAG
finalize, # explicit terminator with answer + citations
]
# System prompt:
SYSTEM = """You are a research assistant. Use web_search and read_url to find
information. Use search_kb when a question is about internal knowledge.
Always cite sources in your final answer using [1], [2], etc., listing them
at the bottom. If sources disagree, note that. If you can't find the answer
after 3 tries, say so honestly.
Use the finalize tool when you have a complete answer."""
Variations:
- Add a planner that runs first (Stage 11 — planning).
- Run multiple queries in parallel with
asyncio.gather. - Output structured JSON for downstream consumption.
- Connect to Slack as a bot.
Project 3 — Vertical agent: PR reviewer (Stages 8, 9, 11, 14)
Automated code review on a GitHub PR.
Inputs:
- PR diff (from gh CLI)
- Repo context (CLAUDE.md / README / conventions)
- Files touched (full bodies for context)
Steps:
1. Plan: identify what the PR is doing.
2. Read: pull each touched file in full.
3. Inspect: look for bugs, style violations, missing tests.
4. Run: lint + type-check + tests in a sandbox.
5. Suggest: post comments on the PR.
Tools:
- gh CLI for fetching the diff and posting comments
- Repository file reader (with size cap)
- Shell tool (sandboxed) for running tests
- MCP server for static analysis
Patterns from Stage 14 — text-to-code:
- Verifier loop: run tests; if failing, ask the model to investigate.
- Confirmation gate before posting comments (manual approval initially).
- Per-PR token budget.
- Always read the full file, not just the diff — context for review.
Project 4 — Domain-tuned Q&A bot (Stages 9, 10)
A chatbot specialized for a domain via RAG + LoRA.
Phase 1: RAG baseline
- Build a RAG over the domain corpus.
- Build a 100-query eval set with expected answers.
- Measure end-to-end correctness on the baseline.
Phase 2: Synthetic data for fine-tuning
- For each chunk, have an LLM generate 3 (question, answer) pairs.
- Filter: keep pairs where the answer is supported by the chunk.
- Result: ~10k high-quality (q, a) pairs.
Phase 3: LoRA fine-tune
- LoRA-fine-tune a 7B model (e.g. Qwen3-7B-Instruct) on the synthetic data.
- Use TRL or Axolotl. ~$10–$30 of compute on RunPod / Modal.
- Hold out 10% for eval.
Phase 4: Compare
- Frontier model + RAG (baseline).
- Fine-tuned 7B + RAG.
- Fine-tuned 7B alone (no RAG).
Often: fine-tune + RAG matches or beats frontier + RAG, at 10× cheaper inference.
Things to watch:
- Quality of synthetic pairs is everything.
- Mix in some general instruction data to avoid catastrophic forgetting.
- Eval must be held out from synthesis.
- Inference cost: serve fine-tuned model with vLLM.
Project 5 — Voice agent for scheduling (Stages 11, 12, 13)
A voice-driven scheduling assistant.
Pipeline:
user voice
↓ Whisper Large v3 Turbo (streaming ASR)
↓ VAD detects end-of-utterance
↓ Claude with tools: get_calendar, propose_time, send_invite
↓ ElevenLabs streaming TTS
↓ user hears reply
Key engineering:
- Sub-700ms end-to-end latency p95
- Interruption handling (cut TTS when user speaks)
- Tool use with confirmation for actions (sending invites)
- Fallback to typed input on ASR failure
Stack candidates (early 2026):
- ASR: Whisper Large v3 Turbo (self-hosted) or Deepgram (managed).
- LLM: Claude Sonnet 4.6 with prompt caching.
- TTS: ElevenLabs (streaming).
- Glue: Vapi, Retell, or custom WebSocket server.
Or: OpenAI Realtime / GPT-4o for end-to-end voice without separate ASR/TTS — lower latency, less control.
Project 6 — Daily news brief (Stages 9, 11, 13)
A scheduled agent that produces a personalized daily briefing.
Daily 6am job:
1. For each topic the user follows:
- search_news(topic, since=yesterday)
- rank by importance (LLM judge)
- keep top 3
2. For each kept article:
- read_url
- summarize (2 sentences)
3. Compile briefing as markdown
4. Email or Slack
Personalization:
- User follows topics: { "AI safety", "Anthropic", "open-source LLMs" }
- User preferences: brief style, length cap, "skip these sources"
Patterns:
- Scheduled background agent (no real-time interaction).
- Output is a file, not a chat — different UX.
- Cost is bounded by topics × frequency; predictable.
- Easy to share-out: a static daily HTML page.
Project 7 — On-device summarizer (Stages 7, 12, 13)
A summarization tool running fully on a Mac via MLX.
Stack:
- mlx-lm for inference (Apple Silicon)
- 4-bit quantized Llama-3.2-3B or Phi-4
- Local web UI (FastAPI + simple HTML)
- No network calls — fully offline
Use cases:
- Summarize a long email
- Extract action items from a meeting transcript
- Bullet-point a long article
Why this matters:
- Privacy: no data leaves the device.
- Cost: zero per-call.
- Latency: instant.
- Demonstrates the on-device frontier (Stage 7 — frontier architectures).
Project 8 — Eval harness (Stages 9, 13)
Infrastructure project: build a reusable LLM eval harness.
Features:
- Datasets: load (q, expected, metadata) tuples from JSON/CSV.
- Runners: invoke any LLM (LiteLLM-style abstraction).
- Metrics: exact match, F1, semantic similarity, LLM-judge.
- Reporting: per-case + aggregate; HTML output; W&B integration.
- CI: run on PR; block if regression beyond threshold.
- History: track metric trends over time.
Look at promptfoo, DeepEval, Inspect for inspiration. Building one from scratch teaches what evals actually require.
Bonus: open-source it.
Choose your project
For each project above, ask:
- Does it match my career direction (Track A / B / C)?
- Is it small enough to ship in 2–4 weeks?
- Is it useful enough to actually use after I ship it?
Pick one. Ship it. Write up what you learned. That’s the most important learning step in the entire path.