Track B — ML Engineer → LLM Specialist

For someone with classical ML / deep-learning experience who wants to specialize in LLMs and modern AI engineering. You know what backprop is. You’ve trained CNNs or RNNs. You can read papers. What you need is the LLM-specific stack — architecturally, operationally, and culturally.

Time: 12–18 weeks at ~10 hours/week. Endpoint: you can train, fine-tune, deploy, and operate LLM-based systems at production quality, and contribute meaningfully on a research-leaning team.

Skim, don’t re-read

You probably already know most of:

Stage 1 (math foundations) — quick refresh on cross-entropy, KL divergence.
Stage 2 (ML fundamentals) — confirm modern LLM eval differences (LLM-as-judge, perplexity, etc.).
Stage 3 (neural networks) — confirm modern optimizer defaults (AdamW, warmup + cosine).

Skim the READMEs and exercises; spot-check anything that’s been a while.

Week-by-week

Weeks 1–2 — From RNNs to transformers, mechanically

Read in depth:

Stage 4 — Why transformers.
Stage 5 — Tokenization, Static, Contextual, Semantic geometry.
Stage 6 — All articles, with depth on:
- Self-attention KQV
- Multi-head attention
- Positional encoding
- Transformer block
- GPT from scratch — build this.

Build:

A 6-layer GPT in <300 lines of PyTorch (or take Karpathy’s nanoGPT and modify it).
Train on TinyShakespeare. Generate convincing fake Shakespeare.
Replace LayerNorm with RMSNorm. Replace learned positional with RoPE. Add KV caching to the inference path. Verify each change doesn’t break quality.

Goal: “I can draw the transformer block on a whiteboard from memory and explain every component.”

Week 3 — Modern LLM landscape

Read:

Build:

Plot training-FLOPs vs validation loss for your nanoGPT at 3 sizes. Eyeball your own scaling law.
Run the same hard prompt through a non-reasoning model and a reasoning model. Compare cost, latency, quality.

Goal: “I can articulate what’s different about a 2026 frontier LLM vs a 2022 GPT-3.”

Week 4 — Prompting and serving

Read:

Stage 8 — All articles, focusing on:
- Structured outputs.
- Sampling & decoding.
Stage 13 — Deployment architectures.
Stage 13 — Cost & latency.

Build:

Set up vLLM serving a Llama-3.x or Qwen3 model on your own GPU.
Compare TPS to a hosted API for the same model.
Implement prompt caching.
Implement basic two-tier model routing.

Goal: “I can self-host a model and serve it efficiently.”

Weeks 5–6 — RAG with depth

Read:

Stage 9 — All articles. Focus on:

Build:

A production-grade RAG over a meaningful corpus (5k+ docs).
Hybrid retrieval + reranking.
Build a 100-query golden set.
Measure recall@10, precision, faithfulness with LLM-as-judge.
Add: HyDE or query decomposition (advanced retrieval). Compare metrics with/without.

Goal: “I can build a RAG that beats a baseline by a measurable margin and prove it.”

Weeks 7–8 — Fine-tuning, deeply

Read:

Stage 10 — All articles. Focus on:

Build:

LoRA-fine-tune a 7B model on a 1k-example domain dataset (TRL or Axolotl).
Hold out a test set; eval against the base model.
DPO-fine-tune the same model on synthetic preference pairs.
(Optional) Embedding fine-tune for a domain retrieval task; measure recall improvement.

Goal: “I can pick a fine-tuning method given a use case, estimate compute, and execute.”

Week 9 — Agents

Read:

Stage 11 — All articles, depth on:

Build:

A multi-tool agent (search, read, code execution, your domain tools).
Build a verifier loop pattern (e.g. for code: edit → test → fix → repeat).
(Optional) Multi-agent: a planner agent + an executor agent.

Goal: “I can build a verifier-looped agent that’s robust to tool failures.”

Week 10 — Multimodal

Read:

Stage 12 — All articles, focusing on:

Build:

A cross-modal search: embed images and text in shared space; retrieve.
Use a VLM (Qwen-VL or Claude vision) for document QA on PDFs.
(Optional) LoRA-fine-tune a VLM for a domain task.

Goal: “Multimodal isn’t scary anymore.”

Weeks 11–12 — Production discipline

Read:

Stage 13 — All remaining articles. Focus on:

Build:

Wire production-grade observability around your RAG and agent (Langfuse / Phoenix).
Add input + output guardrails (Llama Guard, schema validation, citation verification).
Build a regression eval pipeline that runs in CI.
Set cost / latency / quality alerts.

Goal: “I could put my system in front of real users tomorrow.”

Weeks 13–18 — Specialization + ship

Pick a deep-dive direction:

Reasoning RL: dive into GRPO, build a reasoning fine-tune for math or code with verifiable rewards.
Long context: implement YaRN or LongRoPE; train a long-context fine-tune.
MoE: build a small Mixtral-style model from scratch.
Inference optimization: contribute to vLLM / SGLang / TGI.
Multimodal frontier: train a small VLM end-to-end; reproduce a paper.
Agent eval: build domain-specific agent benchmarks; write an eval framework.
Vertical: pick legal / medical / finance / code; go deep.

Ship something public:

Reproduce a paper and write up.
Open-source a model fine-tune on HuggingFace with model card.
Contribute meaningfully to an open-source project.
Publish a series of technical blog posts.

Goal at end: people in your network associate you with “the X person.”

Reading habits to build during the path

Two papers a week related to your specialization.
Run code from one paper a month.
Write up one learning a month publicly.

These compound far more than any single project.

Foundational papers to read along the way

In rough chronological order, read at least the abstract + figures of:

“Attention Is All You Need” (Vaswani 2017).
“BERT” (Devlin 2018).
“GPT-3 / Language Models are Few-Shot Learners” (Brown 2020).
“Scaling Laws for Neural LMs” (Kaplan 2020).
“InstructGPT” (Ouyang 2022).
“Chain-of-Thought Prompting” (Wei 2022).
“Chinchilla / Compute-Optimal LLMs” (Hoffmann 2022).
“LLaMA / LLaMA-2” (Touvron 2023).
“DPO” (Rafailov 2023).
“Mamba” (Gu, Dao 2023) — state-space alternative.
“DeepSeek-V3 / R1” technical reports (2024).
LLaMA-3, Qwen-3, Phi-4 reports.
Latest Anthropic / OpenAI / Google research blog posts.

You don’t need to read all in detail. Skim, find the 2–3 that matter for your specialization, read deeply.

What “done with Track B” looks like

You can:

Reproduce a transformer training run from scratch.
Pick the right fine-tuning method, estimate compute and data, execute, evaluate.
Operate self-hosted inference at production quality.
Build agentic systems with verifier loops.
Read frontier papers and judge what matters.
Defend specific design choices against pushback.

From here, the next moves are:

Apply to applied research or research engineer roles at top labs or AI-native companies.
Contribute to open-source frontier infra (vLLM, sglang, llama.cpp, TRL, Axolotl).
Build domain-specific products with research-grade depth.

Track B — ML Engineer → LLM Specialist

Skim, don’t re-read

Week-by-week

Weeks 1–2 — From RNNs to transformers, mechanically

Week 3 — Modern LLM landscape

Week 4 — Prompting and serving

Weeks 5–6 — RAG with depth

Weeks 7–8 — Fine-tuning, deeply

Week 9 — Agents

Week 10 — Multimodal

Weeks 11–12 — Production discipline

Weeks 13–18 — Specialization + ship

Reading habits to build during the path

Foundational papers to read along the way

What “done with Track B” looks like

See also