production-stack foundations 02 / 17 18 min read · 25 min hands-on

step 02 · ship · foundations

Run a model locally with Ollama

From zero to your first generation in five minutes. Then we'll graduate to vLLM in step 03.

modelinference

Ollama is the easiest way to run an OSS LLM locally. You install it once, pull a model, and you have an OpenAI-compatible API listening on your machine. It’s not the fastest way to serve models — vLLM beats it for throughput and is what we’ll switch to in step 03 — but it’s the fastest way to get tokens out today.

By the end of this step, you’ll have a working model running on your laptop, a streaming endpoint, and a Python client we’ll grow into the real service across the rest of foundations.

Why Ollama as the easy path

Three properties make Ollama the right “first model running” tool:

  1. One command to install. curl https://ollama.com/install.sh | sh on Linux/macOS. It bundles llama.cpp (C++ inference engine), GGUF quantized weights, and a small HTTP server in one daemon.

  2. OpenAI-compatible API by default. Ollama exposes /v1/chat/completions matching the OpenAI schema. So everything you write today against Ollama works, unmodified, against vLLM, against OpenAI, against Anthropic via gateway, against any provider that respects that schema. Your application code doesn’t change when you swap inference engines. This is the single most important reason to start here.

  3. Sane defaults. It picks reasonable quantization (Q4_K_M for most models — the balance between size and quality), it allocates GPU memory if you have a GPU, falls back to CPU if not, and it just works. You don’t tune anything to get a first generation.

The trade: it’s slower than vLLM and doesn’t batch concurrent requests as well. For a single-user development setup, the gap doesn’t matter. For production-grade throughput we’ll switch in step 03.

Install Ollama

On Linux or macOS:

curl -fsSL https://ollama.com/install.sh | sh

On Windows, download the installer — it’s a click-through.

This installs both the ollama CLI and a background service (a launchd agent on macOS, a systemd service on Linux). Verify:

ollama --version

Expected output (version may vary):

ollama version is 0.4.x

If the command isn’t found, restart your shell — the installer adds Ollama to your PATH but the change won’t apply to the current session.

Pull a model

We picked Llama-3.1-8B in step 01. Ollama uses a model:tag namespace; pull it with:

ollama pull llama3.1:8b

That downloads ~5 GB of weights (the Q4_K_M quantization Ollama defaults to). Expect 1–10 minutes depending on your connection.

You can list everything you’ve pulled:

ollama list

Output:

NAME            ID              SIZE      MODIFIED
llama3.1:8b     46e0c10c039e    4.9 GB    2 minutes ago

First generation: shell

Test the install with the simplest possible interaction:

ollama run llama3.1:8b "Write a one-sentence summary of how transformers work."

You’ll see tokens stream in, then a newline. Something like:

Transformers are deep learning architectures that use self-attention
mechanisms to process sequential data, allowing each input element
to dynamically weigh the importance of every other element in
parallel rather than sequentially.

If you got that (or a paraphrase — the model is non-deterministic by default), the install works. You’re now serving a 8B-parameter language model from your laptop.

The first response is slower than subsequent ones — Ollama has to load the model into memory. After that, you’re at conversational speeds (tens of tokens per second on a recent laptop CPU, hundreds on GPU).

First generation: HTTP

The CLI is fine for testing; for our application we want HTTP. Ollama listens on localhost:11434 by default. Hit the OpenAI-compatible endpoint:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [
      {"role": "user", "content": "Reply with the word OK."}
    ],
    "stream": false
  }'

Response shape (truncated):

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "model": "llama3.1:8b",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "OK"
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 1,
    "total_tokens": 15
  }
}

That JSON shape is the OpenAI spec, byte-for-byte. The same shape comes back from api.openai.com, from api.anthropic.com (via their compatibility layer), from any vLLM instance, from Together AI, from Groq, from Fireworks. Your client code is portable from day one.

Streaming

Set "stream": true and the server emits Server-Sent Events:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Count from 1 to 5."}],
    "stream": true
  }'

You’ll see lines like:

data: {"id":"...","choices":[{"delta":{"content":"1"}}]}
data: {"id":"...","choices":[{"delta":{"content":", "}}]}
data: {"id":"...","choices":[{"delta":{"content":"2"}}]}
...
data: [DONE]

That’s the SSE stream format. Every chat-style UI you’ve used (ChatGPT, Claude, Cursor) consumes a stream like this and renders tokens as they arrive. We’ll consume it from Python in a moment.

A Python client

Create the file:

# stack/ollama_client.py
from __future__ import annotations
import json
from typing import Iterator
import httpx


OLLAMA_URL = "http://localhost:11434/v1/chat/completions"


def chat(
    messages: list[dict],
    model: str = "llama3.1:8b",
    temperature: float = 0.7,
    max_tokens: int | None = None,
) -> str:
    """Single-shot chat. Blocks until the full response arrives.

    Returns the assistant's reply text. Throws on non-2xx responses.
    """
    payload = {
        "model": model,
        "messages": messages,
        "stream": False,
        "temperature": temperature,
    }
    if max_tokens is not None:
        payload["max_tokens"] = max_tokens

    with httpx.Client(timeout=120.0) as client:
        r = client.post(OLLAMA_URL, json=payload)
        r.raise_for_status()
        return r.json()["choices"][0]["message"]["content"]


def stream(
    messages: list[dict],
    model: str = "llama3.1:8b",
    temperature: float = 0.7,
) -> Iterator[str]:
    """Streaming chat. Yields token-string chunks as they arrive."""
    payload = {
        "model": model,
        "messages": messages,
        "stream": True,
        "temperature": temperature,
    }
    with httpx.Client(timeout=None) as client:
        with client.stream("POST", OLLAMA_URL, json=payload) as r:
            r.raise_for_status()
            for line in r.iter_lines():
                # SSE format: "data: {...}" per line, blank lines between events.
                if not line or not line.startswith("data: "):
                    continue
                data = line[len("data: "):]
                if data == "[DONE]":
                    break
                obj = json.loads(data)
                # Each chunk has a delta with optional `content`.
                delta = obj["choices"][0].get("delta", {})
                if "content" in delta:
                    yield delta["content"]

The two functions cover the API contract for the rest of the curriculum: chat() for the simple case, stream() for token-at-a-time. Both speak the OpenAI schema, so swapping OLLAMA_URL for an OpenAI base URL works without changing anything else.

Sanity check

Add a __main__ block at the bottom:

# stack/ollama_client.py (bottom of file)
if __name__ == "__main__":
    import sys

    messages = [
        {"role": "system", "content": "You are concise."},
        {"role": "user", "content": "What's 2 + 2?"},
    ]

    print("─── blocking ───")
    print(chat(messages, max_tokens=20))

    print("\n─── streaming ───")
    for chunk in stream(messages):
        sys.stdout.write(chunk)
        sys.stdout.flush()
    print()

Run it:

uv run python -m stack.ollama_client

Expected output (the model is non-deterministic; you’ll see a paraphrase):

─── blocking ───
2 + 2 = 4.

─── streaming ───
2 + 2 = 4.

The blocking version returns once the model has finished generating. The streaming version prints tokens as they arrive — you’ll see “2”, ” ”, ”+”, ” ”, “2”, ” ”, ”=”, ” ”, “4”, ”.” land at human-readable speed.

If both worked, you have a working LLM service running on your machine and a Python client that talks to it. You’ve shipped the first half of the foundations release.

What latency to expect

Rough numbers on common hardware, for the 8B model at Q4_K_M quantization:

HardwareTokens/secFirst-token latency
MacBook Pro M2 (CPU+GPU)30–50~500 ms
MacBook Air M1 (CPU+GPU)15–25~1 s
Linux laptop, no GPU5–10~2 s
RTX 4090 desktop80–120~200 ms
A100 (cloud)200+~150 ms

If you’re way outside this range, something’s wrong — usually the model fell back to CPU when it should be on GPU. Check ollama ps to see what’s loaded and on what device.

Stop the daemon when you’re done

Ollama keeps running in the background after install. To stop the model and free memory:

ollama stop llama3.1:8b

To stop the daemon entirely (rarely needed):

# macOS
launchctl unload ~/Library/LaunchAgents/com.ollama.service.plist

# Linux
sudo systemctl stop ollama

Restarting follows the symmetric command.

Cross-references

  • Sampling Knobs demo — the temperature, top-p, top-k knobs we passed in chat() are the same ones the demo lets you slide on real GPT-2 logits. Same operations, different model.
  • Cost & Latency Calculator demo — the tokens/sec numbers above plug into this calculator’s “self-hosted” mode.
  • Tokenizer Surgery demo — the prompt_tokens field in the response is what BPE produced; the demo shows how.

What we did and didn’t do

What we did:

  • Installed Ollama and pulled an 8B model
  • Hit both the chat and streaming endpoints with curl
  • Wrote a Python client (chat() + stream()) we’ll grow into the production service
  • Sanity-checked end-to-end with a tiny test
  • Established that the OpenAI-compatible schema is what we’ll standardize on

What we didn’t:

  • Tune sampling. We used temperature=0.7 everywhere. Step 04 (eval) will look at how much sampling matters for evaluation; step 11 (multi-agent) will revisit it for application use.
  • Use the native Ollama API. Ollama also has its own non-OpenAI-spec API at /api/generate and /api/chat. Slightly more flexible (supports custom Modelfile config). We stick with the OpenAI-compat path for portability.
  • Quantize differently. Q4_K_M is Ollama’s default and it’s a good default. If you want to compare Q8 (slower, slightly higher quality) or Q3_K (faster, lower quality), ollama pull llama3.1:8b-instruct-q8_0 and similar tags work.
  • Containerize. We’re running Ollama natively. For deployment (step 15) we’ll wrap it (and vLLM) in containers; for development the native install is faster.

Next

Step 03 swaps Ollama for vLLM, the production-grade inference engine. Same model, ~3× the throughput, proper continuous batching, paged KV cache. The Python client we wrote here keeps working unchanged — same OpenAI-compatible schema, just a different base URL. That’s the payoff for picking the portable abstraction in step 02.