step 01 · ship · foundations

Pick your base model

Llama-3 vs Qwen-2.5 vs Mistral vs Phi — what differs, what doesn't, which one we'll use for the rest of the curriculum.

model

If you walk into a HuggingFace search today, you’ll see thousands of OSS LLMs. Most of them don’t matter. About a dozen do — and within those, four families cover ~95% of what serious deployments use in 2026: Llama-3 (Meta), Qwen-2.5 (Alibaba), Mistral (Mistral AI), Phi-3 (Microsoft).

This article is the five-minute decision tree. By the end you’ll have picked the model the rest of the curriculum runs against.

Why this matters more than people admit

Three trade-offs the choice quietly determines:

Your inference cost floor. Cost scales with active parameter count. A 70B model costs ~10× as much per token as a 7B model, regardless of which you serve. If your application has a margin floor, your model size has a ceiling.
Your quality ceiling. Quality scales with parameter count too, though sub-linearly. You can RAG and tool-call a 7B model into shipping; you can’t do the same to a 1B model on most real tasks.
Your ecosystem support. vLLM, llama.cpp, ONNX Runtime, transformers.js — each supports models at different speeds. Llama-3 has every backend on day one. A new fine-tune of an obscure base might not have GGUF quantizations for a month. We’ll feel this in step 14.

Pick the model thoughtfully. It’s the single decision that compounds across every later step.

The four families

The 2026 leaderboard isn’t that contested. These four families are who shows up:

Llama-3 (Meta)

The default choice. Sizes: 1B, 3B, 8B, 70B, 405B. Llama-3.1 added a 128K context window and improved tool use; Llama-3.2 added vision and edge-friendly small models.

Strengths: broadest ecosystem, best tooling, very strong English-language quality
Weaknesses: weaker on non-Latin languages than Qwen, slightly weaker on code than Qwen-Coder
License: Llama Community License (mostly permissive but has revenue-cap clause for >700M MAU; you don’t care)

Qwen-2.5 (Alibaba)

The closest competitor at most scales. Sizes: 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B, plus specialized Qwen-Coder and Qwen-Math fine-tunes.

Strengths: strongest non-English capability, best math/code in the OSS class, very large context (Qwen-2.5-7B-Instruct: 128K)
Weaknesses: smaller community, fewer pre-built integrations, occasional Chinese-language artifacts in English output
License: Apache 2.0 for most variants — the most permissive

Mistral

Mistral has shipped a steady cadence of bases: Mistral 7B, Mixtral 8×7B (MoE), Mistral Nemo (12B), Mistral Large. Slightly behind Llama and Qwen on raw benchmarks now, but Mixtral’s MoE architecture is the cheapest path to “70B-class quality at 13B-class active params.”

Strengths: Mixtral MoE inference economics, strong European-language quality
Weaknesses: model release pace has slowed in 2025–2026, some recent releases are API-only (not OSS)
License: Apache 2.0 for the OSS variants

Phi-3 (Microsoft)

The “small but punching above weight” family. Phi-3-mini (3.8B), Phi-3-small (7B), Phi-3-medium (14B). Trained on synthetic high-quality data; benchmarks above their weight class but with a noticeable “training data was generated by GPT-4” feel.

Strengths: best quality per parameter at small scales, fits in resource-constrained environments
Weaknesses: less broad knowledge than equivalent-sized Llama/Qwen, output style can feel synthetic
License: MIT

The four axes you actually decide on

Forget the leaderboards. Pick by what you’re actually optimizing for:

1. Hardware budget

What model fits in the memory you have, with the quantization you can tolerate?

RAM / VRAM	Comfortable	Tight fit	Don’t try
8 GB	1B–3B	7B (Q4)	13B+
16 GB	7B–8B	13B (Q4)	30B+
24 GB (RTX 4090)	13B	30B (Q4)	70B
48 GB (A6000)	30B	70B (Q4)	405B
80 GB+ (A100/H100)	70B	70B (FP16)	405B (no quant)

“Q4” means 4-bit quantization. We covered why quantization works in /build step 14; the Quantization Lab demo is the visual primer. Q4_K_M is the production default — minimal quality loss, ~4× memory savings.

2. License compatibility

If you’re shipping a commercial product, read the license. The landscape:

Apache 2.0 / MIT (Qwen-2.5, Mistral OSS, Phi-3): no restrictions
Llama Community License (Llama-3): unrestricted unless you have >700M MAU. You don’t.
Custom non-commercial (some research models): can’t ship in a paid product

If your lawyers care, default to Qwen or Mistral. If you don’t have lawyers, Llama is fine.

3. Language and domain

English only, general use: Llama-3-8B is the default-default
Multilingual (non-Latin scripts): Qwen-2.5 is the right choice
Code-heavy task: Qwen-Coder-2.5 (or DeepSeek-Coder; not in the four families above but a strong pick)
Math-heavy task: Qwen-2.5-Math
Edge / mobile: Phi-3-mini or Llama-3.2-1B

4. Ecosystem maturity

How well-supported is the model across the tools you’ll use? Specifically:

vLLM support (matters for step 03): Llama, Qwen, Mistral all day-one. Phi-3 supported but lags.
GGUF quantization (matters if you’re staying on llama.cpp/Ollama): same as above
Tool-calling fine-tunes: Llama-3.1 has the best community support; Qwen-2.5 close behind
HuggingFace community examples: Llama dominates by 5–10× per model

The decision matrix

You want…	Pick
Default for the rest of this curriculum	Llama-3.1-8B-Instruct
Smallest viable model on a laptop	Llama-3.2-3B or Qwen-2.5-3B
Best quality at 7B scale, multilingual	Qwen-2.5-7B-Instruct
MoE / efficient large model	Mixtral-8×7B-Instruct
Code-specialized	Qwen-Coder-2.5-7B
Math-specialized	Qwen-2.5-Math-7B
Apache-2.0 license required	Qwen-2.5 or Mistral OSS
Most permissive small model	Phi-3-mini

What this curriculum uses

We pick Llama-3.1-8B-Instruct, Q4_K_M quantization for the rest of /ship. Three reasons:

It fits comfortably on a 16 GB laptop at Q4. Most readers have one.
It has the broadest tooling — vLLM, Ollama, llama.cpp, transformers.js all support it on day one of release.
Its quality is “good enough to ship” at this size for most non-trivial tasks. Smaller (3B, mini) sometimes works; larger is overkill for a tutorial.

Wherever this curriculum says “Llama-3.1-8B” you can substitute Qwen-2.5-7B-Instruct and the steps work unchanged. The OpenAI-compatible API in step 02 means which model the inference engine serves doesn’t propagate up to your application code. Pick the model once, then forget about it for ten steps.

Common pitfalls when picking

Choosing by raw benchmark score. MMLU and HellaSwag scores correlate with quality but don’t determine it. A model that scores 1% higher on MMLU but is twice as expensive to serve is a worse choice for most production tasks. Pick by use-case fit, not leaderboard rank.

Choosing the latest “frontier” OSS without checking ecosystem support. A model released last week often doesn’t have GGUF, AWQ, or vLLM support yet. Wait two weeks; the ecosystem catches up.

Choosing too large. Most production tasks don’t need 70B. RAG + a 7B model often beats a 70B without RAG, at 10% the cost. We’ll see this in step 06.

Choosing too small to fit your task. A 1B model can’t reliably follow complex instructions. If your task involves multi-step reasoning, tool use, or careful constraint-following, the floor is 7B. Don’t try to make a 1B model do agent work.

try this

Pick three models from the table above (e.g., llama3.2:3b, llama3.1:8b, and qwen2.5:7b). Pull all three with Ollama and run the same prompt against each:

for model in llama3.2:3b llama3.1:8b qwen2.5:7b; do
  echo "── $model ──"
  ollama run $model "Write a Python function that returns the n-th Fibonacci number using memoization."
done

Read the three outputs side by side. The 3B model probably gets it almost right; the 8Bs nail it. The qualitative gap between 3B and 8B is the floor for “production-ready code generation.” That’s the gap you’re paying for when you size up.

Cross-references

Scaling Laws Calculator demo — Chinchilla optimal token counts at each parameter size; gives you a sense of what “well-trained” looks like for each size class
Quantization Lab demo — what Q4 actually does to weight distributions; useful for understanding why Q4_K_M is the production default
Tokenizer Surgery demo — different model families have different vocab sizes; affects how many tokens your prompts cost

Step 02 — which you may have already read — pulls Llama-3.1-8B with Ollama, gets you token streaming, writes the first piece of your Python client. After that we graduate to vLLM (step 03) for production-grade throughput.