Track B — ML Engineer → LLM Specialist

For someone with classical ML / deep-learning experience who wants to specialize in LLMs and modern AI engineering. You know what backprop is. You’ve trained CNNs or RNNs. You can read papers. What you need is the LLM-specific stack — architecturally, operationally, and culturally.

Time: 12–18 weeks at ~10 hours/week. Endpoint: you can train, fine-tune, deploy, and operate LLM-based systems at production quality, and contribute meaningfully on a research-leaning team.


Skim, don’t re-read

You probably already know most of:

  • Stage 1 (math foundations) — quick refresh on cross-entropy, KL divergence.
  • Stage 2 (ML fundamentals) — confirm modern LLM eval differences (LLM-as-judge, perplexity, etc.).
  • Stage 3 (neural networks) — confirm modern optimizer defaults (AdamW, warmup + cosine).

Skim the READMEs and exercises; spot-check anything that’s been a while.


Week-by-week

Weeks 1–2 — From RNNs to transformers, mechanically

Read in depth:

Build:

  • A 6-layer GPT in <300 lines of PyTorch (or take Karpathy’s nanoGPT and modify it).
  • Train on TinyShakespeare. Generate convincing fake Shakespeare.
  • Replace LayerNorm with RMSNorm. Replace learned positional with RoPE. Add KV caching to the inference path. Verify each change doesn’t break quality.

Goal: “I can draw the transformer block on a whiteboard from memory and explain every component.”

Week 3 — Modern LLM landscape

Read:

Build:

  • Plot training-FLOPs vs validation loss for your nanoGPT at 3 sizes. Eyeball your own scaling law.
  • Run the same hard prompt through a non-reasoning model and a reasoning model. Compare cost, latency, quality.

Goal: “I can articulate what’s different about a 2026 frontier LLM vs a 2022 GPT-3.”

Week 4 — Prompting and serving

Read:

Build:

  • Set up vLLM serving a Llama-3.x or Qwen3 model on your own GPU.
  • Compare TPS to a hosted API for the same model.
  • Implement prompt caching.
  • Implement basic two-tier model routing.

Goal: “I can self-host a model and serve it efficiently.”

Weeks 5–6 — RAG with depth

Read:

Build:

  • A production-grade RAG over a meaningful corpus (5k+ docs).
  • Hybrid retrieval + reranking.
  • Build a 100-query golden set.
  • Measure recall@10, precision, faithfulness with LLM-as-judge.
  • Add: HyDE or query decomposition (advanced retrieval). Compare metrics with/without.

Goal: “I can build a RAG that beats a baseline by a measurable margin and prove it.”

Weeks 7–8 — Fine-tuning, deeply

Read:

Build:

  • LoRA-fine-tune a 7B model on a 1k-example domain dataset (TRL or Axolotl).
  • Hold out a test set; eval against the base model.
  • DPO-fine-tune the same model on synthetic preference pairs.
  • (Optional) Embedding fine-tune for a domain retrieval task; measure recall improvement.

Goal: “I can pick a fine-tuning method given a use case, estimate compute, and execute.”

Week 9 — Agents

Read:

Build:

  • A multi-tool agent (search, read, code execution, your domain tools).
  • Build a verifier loop pattern (e.g. for code: edit → test → fix → repeat).
  • (Optional) Multi-agent: a planner agent + an executor agent.

Goal: “I can build a verifier-looped agent that’s robust to tool failures.”

Week 10 — Multimodal

Read:

Build:

  • A cross-modal search: embed images and text in shared space; retrieve.
  • Use a VLM (Qwen-VL or Claude vision) for document QA on PDFs.
  • (Optional) LoRA-fine-tune a VLM for a domain task.

Goal: “Multimodal isn’t scary anymore.”

Weeks 11–12 — Production discipline

Read:

Build:

  • Wire production-grade observability around your RAG and agent (Langfuse / Phoenix).
  • Add input + output guardrails (Llama Guard, schema validation, citation verification).
  • Build a regression eval pipeline that runs in CI.
  • Set cost / latency / quality alerts.

Goal: “I could put my system in front of real users tomorrow.”

Weeks 13–18 — Specialization + ship

Pick a deep-dive direction:

  • Reasoning RL: dive into GRPO, build a reasoning fine-tune for math or code with verifiable rewards.
  • Long context: implement YaRN or LongRoPE; train a long-context fine-tune.
  • MoE: build a small Mixtral-style model from scratch.
  • Inference optimization: contribute to vLLM / SGLang / TGI.
  • Multimodal frontier: train a small VLM end-to-end; reproduce a paper.
  • Agent eval: build domain-specific agent benchmarks; write an eval framework.
  • Vertical: pick legal / medical / finance / code; go deep.

Ship something public:

  • Reproduce a paper and write up.
  • Open-source a model fine-tune on HuggingFace with model card.
  • Contribute meaningfully to an open-source project.
  • Publish a series of technical blog posts.

Goal at end: people in your network associate you with “the X person.”


Reading habits to build during the path

  • Two papers a week related to your specialization.
  • Run code from one paper a month.
  • Write up one learning a month publicly.

These compound far more than any single project.


Foundational papers to read along the way

In rough chronological order, read at least the abstract + figures of:

  • “Attention Is All You Need” (Vaswani 2017).
  • “BERT” (Devlin 2018).
  • “GPT-3 / Language Models are Few-Shot Learners” (Brown 2020).
  • “Scaling Laws for Neural LMs” (Kaplan 2020).
  • “InstructGPT” (Ouyang 2022).
  • “Chain-of-Thought Prompting” (Wei 2022).
  • “Chinchilla / Compute-Optimal LLMs” (Hoffmann 2022).
  • “LLaMA / LLaMA-2” (Touvron 2023).
  • “DPO” (Rafailov 2023).
  • “Mamba” (Gu, Dao 2023) — state-space alternative.
  • “DeepSeek-V3 / R1” technical reports (2024).
  • LLaMA-3, Qwen-3, Phi-4 reports.
  • Latest Anthropic / OpenAI / Google research blog posts.

You don’t need to read all in detail. Skim, find the 2–3 that matter for your specialization, read deeply.


What “done with Track B” looks like

You can:

  • Reproduce a transformer training run from scratch.
  • Pick the right fine-tuning method, estimate compute and data, execute, evaluate.
  • Operate self-hosted inference at production quality.
  • Build agentic systems with verifier loops.
  • Read frontier papers and judge what matters.
  • Defend specific design choices against pushback.

From here, the next moves are:

  • Apply to applied research or research engineer roles at top labs or AI-native companies.
  • Contribute to open-source frontier infra (vLLM, sglang, llama.cpp, TRL, Axolotl).
  • Build domain-specific products with research-grade depth.

See also