step 01 · ship · foundations
Pick your base model
Llama-3 vs Qwen-2.5 vs Mistral vs Phi — what differs, what doesn't, which one we'll use for the rest of the curriculum.
If you walk into a HuggingFace search today, you’ll see thousands of OSS LLMs. Most of them don’t matter. About a dozen do — and within those, four families cover ~95% of what serious deployments use in 2026: Llama-3 (Meta), Qwen-2.5 (Alibaba), Mistral (Mistral AI), Phi-3 (Microsoft).
This article is the five-minute decision tree. By the end you’ll have picked the model the rest of the curriculum runs against.
Why this matters more than people admit
Three trade-offs the choice quietly determines:
- Your inference cost floor. Cost scales with active parameter count. A 70B model costs ~10× as much per token as a 7B model, regardless of which you serve. If your application has a margin floor, your model size has a ceiling.
- Your quality ceiling. Quality scales with parameter count too, though sub-linearly. You can RAG and tool-call a 7B model into shipping; you can’t do the same to a 1B model on most real tasks.
- Your ecosystem support. vLLM, llama.cpp, ONNX Runtime, transformers.js — each supports models at different speeds. Llama-3 has every backend on day one. A new fine-tune of an obscure base might not have GGUF quantizations for a month. We’ll feel this in step 14.
Pick the model thoughtfully. It’s the single decision that compounds across every later step.
The four families
The 2026 leaderboard isn’t that contested. These four families are who shows up:
Llama-3 (Meta)
The default choice. Sizes: 1B, 3B, 8B, 70B, 405B. Llama-3.1 added a 128K context window and improved tool use; Llama-3.2 added vision and edge-friendly small models.
- Strengths: broadest ecosystem, best tooling, very strong English-language quality
- Weaknesses: weaker on non-Latin languages than Qwen, slightly weaker on code than Qwen-Coder
- License: Llama Community License (mostly permissive but has revenue-cap clause for >700M MAU; you don’t care)
Qwen-2.5 (Alibaba)
The closest competitor at most scales. Sizes: 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B, plus specialized Qwen-Coder and Qwen-Math fine-tunes.
- Strengths: strongest non-English capability, best math/code in the OSS class, very large context (Qwen-2.5-7B-Instruct: 128K)
- Weaknesses: smaller community, fewer pre-built integrations, occasional Chinese-language artifacts in English output
- License: Apache 2.0 for most variants — the most permissive
Mistral
Mistral has shipped a steady cadence of bases: Mistral 7B, Mixtral 8×7B (MoE), Mistral Nemo (12B), Mistral Large. Slightly behind Llama and Qwen on raw benchmarks now, but Mixtral’s MoE architecture is the cheapest path to “70B-class quality at 13B-class active params.”
- Strengths: Mixtral MoE inference economics, strong European-language quality
- Weaknesses: model release pace has slowed in 2025–2026, some recent releases are API-only (not OSS)
- License: Apache 2.0 for the OSS variants
Phi-3 (Microsoft)
The “small but punching above weight” family. Phi-3-mini (3.8B), Phi-3-small (7B), Phi-3-medium (14B). Trained on synthetic high-quality data; benchmarks above their weight class but with a noticeable “training data was generated by GPT-4” feel.
- Strengths: best quality per parameter at small scales, fits in resource-constrained environments
- Weaknesses: less broad knowledge than equivalent-sized Llama/Qwen, output style can feel synthetic
- License: MIT
The four axes you actually decide on
Forget the leaderboards. Pick by what you’re actually optimizing for:
1. Hardware budget
What model fits in the memory you have, with the quantization you can tolerate?
| RAM / VRAM | Comfortable | Tight fit | Don’t try |
|---|---|---|---|
| 8 GB | 1B–3B | 7B (Q4) | 13B+ |
| 16 GB | 7B–8B | 13B (Q4) | 30B+ |
| 24 GB (RTX 4090) | 13B | 30B (Q4) | 70B |
| 48 GB (A6000) | 30B | 70B (Q4) | 405B |
| 80 GB+ (A100/H100) | 70B | 70B (FP16) | 405B (no quant) |
“Q4” means 4-bit quantization. We covered why quantization works in /build step 14; the Quantization Lab demo is the visual primer. Q4_K_M is the production default — minimal quality loss, ~4× memory savings.
2. License compatibility
If you’re shipping a commercial product, read the license. The landscape:
- Apache 2.0 / MIT (Qwen-2.5, Mistral OSS, Phi-3): no restrictions
- Llama Community License (Llama-3): unrestricted unless you have >700M MAU. You don’t.
- Custom non-commercial (some research models): can’t ship in a paid product
If your lawyers care, default to Qwen or Mistral. If you don’t have lawyers, Llama is fine.
3. Language and domain
- English only, general use: Llama-3-8B is the default-default
- Multilingual (non-Latin scripts): Qwen-2.5 is the right choice
- Code-heavy task: Qwen-Coder-2.5 (or DeepSeek-Coder; not in the four families above but a strong pick)
- Math-heavy task: Qwen-2.5-Math
- Edge / mobile: Phi-3-mini or Llama-3.2-1B
4. Ecosystem maturity
How well-supported is the model across the tools you’ll use? Specifically:
- vLLM support (matters for step 03): Llama, Qwen, Mistral all day-one. Phi-3 supported but lags.
- GGUF quantization (matters if you’re staying on llama.cpp/Ollama): same as above
- Tool-calling fine-tunes: Llama-3.1 has the best community support; Qwen-2.5 close behind
- HuggingFace community examples: Llama dominates by 5–10× per model
The decision matrix
| You want… | Pick |
|---|---|
| Default for the rest of this curriculum | Llama-3.1-8B-Instruct |
| Smallest viable model on a laptop | Llama-3.2-3B or Qwen-2.5-3B |
| Best quality at 7B scale, multilingual | Qwen-2.5-7B-Instruct |
| MoE / efficient large model | Mixtral-8×7B-Instruct |
| Code-specialized | Qwen-Coder-2.5-7B |
| Math-specialized | Qwen-2.5-Math-7B |
| Apache-2.0 license required | Qwen-2.5 or Mistral OSS |
| Most permissive small model | Phi-3-mini |
What this curriculum uses
We pick Llama-3.1-8B-Instruct, Q4_K_M quantization for the rest of /ship. Three reasons:
- It fits comfortably on a 16 GB laptop at Q4. Most readers have one.
- It has the broadest tooling — vLLM, Ollama, llama.cpp, transformers.js all support it on day one of release.
- Its quality is “good enough to ship” at this size for most non-trivial tasks. Smaller (3B, mini) sometimes works; larger is overkill for a tutorial.
Wherever this curriculum says “Llama-3.1-8B” you can substitute Qwen-2.5-7B-Instruct and the steps work unchanged. The OpenAI-compatible API in step 02 means which model the inference engine serves doesn’t propagate up to your application code. Pick the model once, then forget about it for ten steps.
Common pitfalls when picking
Choosing by raw benchmark score. MMLU and HellaSwag scores correlate with quality but don’t determine it. A model that scores 1% higher on MMLU but is twice as expensive to serve is a worse choice for most production tasks. Pick by use-case fit, not leaderboard rank.
Choosing the latest “frontier” OSS without checking ecosystem support. A model released last week often doesn’t have GGUF, AWQ, or vLLM support yet. Wait two weeks; the ecosystem catches up.
Choosing too large. Most production tasks don’t need 70B. RAG + a 7B model often beats a 70B without RAG, at 10% the cost. We’ll see this in step 06.
Choosing too small to fit your task. A 1B model can’t reliably follow complex instructions. If your task involves multi-step reasoning, tool use, or careful constraint-following, the floor is 7B. Don’t try to make a 1B model do agent work.
Cross-references
- Scaling Laws Calculator demo — Chinchilla optimal token counts at each parameter size; gives you a sense of what “well-trained” looks like for each size class
- Quantization Lab demo — what Q4 actually does to weight distributions; useful for understanding why Q4_K_M is the production default
- Tokenizer Surgery demo — different model families have different vocab sizes; affects how many tokens your prompts cost
Next
Step 02 — which you may have already read — pulls Llama-3.1-8B with Ollama, gets you token streaming, writes the first piece of your Python client. After that we graduate to vLLM (step 03) for production-grade throughput.