production-stack building 07 / 17 22 min read · 30 min hands-on

step 07 · ship · building

Embeddings + vector store

Pick an embedding model, persist with sqlite-vec, retrieve at 5 ms. Plus: why you probably don't need a dedicated vector DB.

ragembeddings

You have chunks. Now you need a way to find the right ones quickly when a user asks a question. That’s what embeddings + a vector store do: convert each chunk to a fixed-length vector, store it, and at query time find the chunks whose vectors are closest to the query’s vector.

This step picks the embedding model, picks the vector store, and ships both behind a clean interface. The opinionated take: sqlite-vec is the right vector store for ~80% of production deployments. We’ll defend that, then build it.

Pick an embedding model

The space looks more confusing than it is. In 2026, three categories cover virtually all use cases:

Local sentence-transformers (the default)

sentence-transformers/all-MiniLM-L6-v2 is the workhorse. 384 dimensions, ~25 MB on disk, 100k embeddings per second on a laptop CPU. Quality is good enough for most retrieval. This is what we use.

For higher quality, BAAI/bge-large-en-v1.5 (1024 dimensions) or Snowflake/snowflake-arctic-embed-m (768 dimensions) are the modern leaderboard champions. ~10× slower, somewhat better recall.

API embeddings

OpenAI’s text-embedding-3-small (1536 dim) and text-embedding-3-large (3072 dim). Voyage AI’s voyage-3 and voyage-code-2. Cohere embed-v3.

When to use: if you’re already paying for an API and don’t want to manage local inference. Quality is competitive with the best OSS but you pay per-call (~$0.02 per 1M tokens) and you’re round-tripping every query through the network. For self-hosted RAG, this defeats the point.

Specialized models

Qwen-Embedding, BGE-M3 (multilingual + multi-vector), nomic-embed-text (long context). Each shines at a niche; for general English retrieval, MiniLM-L6 is hard to beat at its size.

We use all-MiniLM-L6-v2 for the rest of the curriculum. 384 dimensions, fast, good enough, no API calls.

Pick a vector store

Your options, ranked by “how much you should reach for this in 2026”:

sqlite-vec (the default)

A SQLite extension that adds vector search. The whole vector store is one .db file. No server, no daemon, no config. Embeddable in any Python application. ANN-quality search (HNSW) at hundreds of QPS on a single machine, scales to ~10M vectors comfortably.

Use when: anything under ~10M chunks, single-machine deployments, applications where simplicity wins. Default starting point.

pgvector

PostgreSQL extension. If you already have Postgres, this is free — your vectors live alongside your application data and you transact across both. Same ANN performance as sqlite-vec.

Use when: you already have a Postgres database, you want one storage system for everything.

Chroma / Qdrant / Weaviate

Dedicated vector databases. More features (per-namespace permissions, metadata filtering, multi-tenant isolation). Run as a server.

Use when: very high query volume, multi-tenant, or you genuinely need features the SQL-extensions don’t have.

Managed cloud vector services. Pay-per-use, no operational overhead.

Use when: you’re already on the corresponding cloud and team prefers managed services.

The unfashionable truth: for most production RAG, the right choice is sqlite-vec or pgvector. Dedicated vector DBs are operational overhead you don’t need until you’re past 10M+ chunks or have specific multi-tenant requirements.

We use sqlite-vec.

Setup

uv add sentence-transformers sqlite-vec numpy

Three packages, no GPU required, fully offline once models are downloaded.

# stack/embed.py
from __future__ import annotations
import json
import sqlite3
from dataclasses import dataclass
from pathlib import Path
from typing import Iterable
import numpy as np
import sqlite_vec
from sentence_transformers import SentenceTransformer

from stack.chunk import Chunk


# Default embedding model. 384-d, ~25 MB on disk, 100k/sec on CPU.
EMBED_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
EMBED_DIM = 384


class Embedder:
    """Lazy-loaded sentence-transformers wrapper.

    The first call to .embed() loads the model (~3 sec). Subsequent
    calls are fast. Thread-safe for read-only inference.
    """

    def __init__(self, model_name: str = EMBED_MODEL_NAME):
        self.model_name = model_name
        self._model: SentenceTransformer | None = None

    @property
    def model(self) -> SentenceTransformer:
        if self._model is None:
            self._model = SentenceTransformer(self.model_name)
        return self._model

    def embed(self, texts: list[str], normalize: bool = True) -> np.ndarray:
        """Embed a list of strings; returns (n, EMBED_DIM) float32 array.

        normalize=True (default) projects to unit length so cosine
        similarity reduces to dot product, which is what sqlite-vec
        scores against under the hood.
        """
        return self.model.encode(
            texts,
            normalize_embeddings=normalize,
            show_progress_bar=False,
        ).astype(np.float32)


@dataclass
class Hit:
    chunk_id: int
    score: float
    text: str
    metadata: dict

Two pieces: a lazy-loaded embedder, and a Hit dataclass for retrieval results. Now the store.

The vector store

# stack/embed.py (continuing)
class VectorStore:
    """SQLite-backed vector store using the sqlite-vec extension.

    Stores chunks + their embeddings in two tables:
      • chunks        — id, text, metadata (json), source
      • chunk_vec     — id (FK to chunks.id), embedding (vector blob)

    Lookup: cosine similarity (via dot product on normalized vectors).
    """

    def __init__(self, db_path: str | Path = "data/chunks.db"):
        self.db_path = Path(db_path)
        self.db_path.parent.mkdir(parents=True, exist_ok=True)
        self._conn = self._connect()
        self._init_schema()

    def _connect(self) -> sqlite3.Connection:
        conn = sqlite3.connect(self.db_path)
        conn.enable_load_extension(True)
        sqlite_vec.load(conn)
        conn.enable_load_extension(False)
        return conn

    def _init_schema(self) -> None:
        cur = self._conn.cursor()
        cur.execute("""
            CREATE TABLE IF NOT EXISTS chunks (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                text TEXT NOT NULL,
                metadata TEXT NOT NULL,
                source TEXT
            )
        """)
        # vec0 is sqlite-vec's virtual table for vector indexes.
        cur.execute(f"""
            CREATE VIRTUAL TABLE IF NOT EXISTS chunk_vec USING vec0(
                embedding float[{EMBED_DIM}]
            )
        """)
        self._conn.commit()

    def add(self, chunks: list[Chunk], embeddings: np.ndarray) -> list[int]:
        """Insert chunks + their embeddings; return the new IDs."""
        assert len(chunks) == embeddings.shape[0]
        cur = self._conn.cursor()
        ids: list[int] = []
        for chunk, emb in zip(chunks, embeddings):
            cur.execute(
                "INSERT INTO chunks (text, metadata, source) VALUES (?, ?, ?)",
                (chunk.text, json.dumps(chunk.metadata), chunk.metadata.get("source")),
            )
            chunk_id = cur.lastrowid
            cur.execute(
                "INSERT INTO chunk_vec (rowid, embedding) VALUES (?, ?)",
                (chunk_id, emb.tobytes()),
            )
            ids.append(chunk_id)
        self._conn.commit()
        return ids

    def search(self, query_embedding: np.ndarray, k: int = 5) -> list[Hit]:
        """Return top-k chunks by cosine similarity to the query."""
        cur = self._conn.cursor()
        cur.execute(
            """
            SELECT
                c.id, c.text, c.metadata, v.distance
            FROM chunk_vec AS v
            JOIN chunks    AS c ON c.id = v.rowid
            WHERE v.embedding MATCH ?
              AND k = ?
            ORDER BY v.distance ASC
            """,
            (query_embedding.astype(np.float32).tobytes(), k),
        )
        results: list[Hit] = []
        for chunk_id, text, metadata_json, distance in cur.fetchall():
            # sqlite-vec returns L2 distance on the (normalized) vectors.
            # For unit-length vectors, similarity = 1 - 0.5 * L2².
            score = 1.0 - 0.5 * (distance ** 2)
            results.append(Hit(
                chunk_id=chunk_id,
                score=score,
                text=text,
                metadata=json.loads(metadata_json),
            ))
        return results

    def count(self) -> int:
        cur = self._conn.cursor()
        cur.execute("SELECT COUNT(*) FROM chunks")
        return cur.fetchone()[0]

    def close(self) -> None:
        self._conn.close()

Key design choices:

  • Two tables, one row per chunk. chunks holds the text + metadata; chunk_vec holds the vector. They share rowid (chunk_id). One join at query time.
  • Cosine via L2. sqlite-vec returns L2 distance. For unit-length (normalized) vectors, cosine(a, b) = 1 - 0.5 · L2(a, b)². We convert at the boundary so callers see a score ∈ [0, 1], where higher is better.
  • Metadata as JSON in a TEXT column. Cheap and flexible. If you need queryable filters (e.g. WHERE source = 'foo.md') you’d add per-key columns; for the curriculum, JSON is fine.

Pipeline: chunk → embed → store

A small index_documents() helper ties the pieces together:

# stack/embed.py (continuing)
def index_documents(
    docs: list[tuple[str, dict]],   # (text, metadata) pairs
    chunker,                         # any Chunker from step 06
    store: VectorStore,
    embedder: Embedder,
    batch_size: int = 64,
) -> int:
    """Run the full chunk → embed → store pipeline. Returns # chunks indexed."""
    all_chunks: list[Chunk] = []
    for text, metadata in docs:
        all_chunks.extend(chunker.split(text, metadata))

    print(f"  produced {len(all_chunks)} chunks")

    # Batched embedding. The model is fast but loading is the cost;
    # batching amortizes the per-call overhead.
    n_indexed = 0
    for i in range(0, len(all_chunks), batch_size):
        batch = all_chunks[i : i + batch_size]
        texts = [c.text for c in batch]
        embeddings = embedder.embed(texts)
        store.add(batch, embeddings)
        n_indexed += len(batch)
    return n_indexed

Sanity check

# stack/embed.py (bottom)
if __name__ == "__main__":
    import time
    from stack.chunk import MarkdownChunker, SAMPLE_DOC

    # Set up a clean store (delete first if you want to re-run idempotently).
    store_path = Path("data/chunks.db")
    if store_path.exists():
        store_path.unlink()

    store = VectorStore(store_path)
    embedder = Embedder()
    chunker = MarkdownChunker(max_tokens=200)

    # Index the sample doc.
    print("─── indexing ───")
    docs = [(SAMPLE_DOC, {"source": "db-docs.md"})]
    n = index_documents(docs, chunker, store, embedder)
    print(f"  indexed: {n} chunks")
    print(f"  total in store: {store.count()}")

    # Run a few queries.
    print("\n─── retrieval ───")
    queries = [
        "How do I install PostgreSQL?",
        "What's the format for the connection string?",
        "How do I run database migrations?",
    ]
    for query in queries:
        t0 = time.perf_counter()
        q_emb = embedder.embed([query])[0]
        hits = store.search(q_emb, k=2)
        elapsed_ms = (time.perf_counter() - t0) * 1000

        print(f"\n  Q: {query}")
        print(f"     [{elapsed_ms:.1f} ms]")
        for hit in hits:
            preview = hit.text[:60].replace("\n", " ⏎ ")
            headings = " > ".join(hit.metadata.get("headings", []))
            print(f"     {hit.score:.3f}  {headings}")
            print(f"            {preview}…")

    store.close()

Run:

uv run python -m stack.embed

Expected output (timings vary, scores depend on the embedder; output approximate):

─── indexing ───
  produced 4 chunks
  indexed: 4 chunks
  total in store: 4

─── retrieval ───

  Q: How do I install PostgreSQL?
     [4.7 ms]
     0.624  Setting Up the Database > Installing the Server
            ## Installing the Server  ⏎  ⏎ We use PostgreSQL 16. On macO…
     0.418  Setting Up the Database
            # Setting Up the Database  ⏎  ⏎ Before you can run the appli…

  Q: What's the format for the connection string?
     [3.2 ms]
     0.681  Setting Up the Database > Connection Strings
            ## Connection Strings  ⏎  ⏎ The application reads `DATABASE_…
     0.342  Setting Up the Database > Installing the Server
            ## Installing the Server  ⏎  ⏎ We use PostgreSQL 16. On macO…

  Q: How do I run database migrations?
     [3.4 ms]
     0.706  Setting Up the Database > Creating the Schema
            ## Creating the Schema  ⏎  ⏎ Run the migrations from the pro…
     0.395  Setting Up the Database > Connection Strings
            ## Connection Strings  ⏎  ⏎ The application reads `DATABASE_…

What to notice:

  • Each query returns the right top-1. “Install PostgreSQL?” → “Installing the Server” section. “Connection string?” → “Connection Strings” section. Dense embedding retrieval works.
  • The score gap between #1 and #2 is meaningful. First hit is ~0.6+, second drops to ~0.4. That gap is the model’s confidence; if the gap collapses, retrieval is weak.
  • Retrieval is ~3–5 ms per query at this corpus size. At 50K chunks, expect 10–30 ms; at 1M, 30–100 ms. sqlite-vec stays fast through the range that matters.
  • Heading metadata is preserved. When the retriever returns a hit, the chunk’s place in the document hierarchy comes along for free. The next step uses it.

Performance numbers worth remembering

Rough operating points for MiniLM-L6 + sqlite-vec on a laptop CPU:

Corpus sizeDB sizeIndex build timeQuery latency
1K chunks~5 MB~3 sec~1 ms
100K chunks~500 MB~5 min~5 ms
1M chunks~5 GB~50 min~20 ms
10M chunks~50 GB~8 hr~80 ms

These are real numbers (no GPU). At 10M+ chunks you start wanting a real ANN index (HNSW with proper tuning, FAISS, or moving to a dedicated vector DB). Below that, sqlite-vec is fine.

Cross-references

What we did and didn’t do

What we did:

  • Lazy-loaded sentence-transformers/all-MiniLM-L6-v2 for embeddings
  • Stood up a sqlite-vec backed store with proper schema, batched indexing, cosine retrieval
  • Indexed real chunks from step 06’s chunker; retrieval works at ~5 ms
  • Preserved chunk metadata (including heading hierarchy) through the pipeline

What we didn’t:

  • Implement BM25 retrieval. That’s step 08 — keyword retrieval to complement dense retrieval, both ranked together via reciprocal rank fusion.
  • Add metadata filtering. Real production wants WHERE source LIKE 'docs/%' or WHERE created_at > ?. The schema already has the source column ready; we’ll add typed filter support in step 08.
  • Implement late chunking (mentioned in step 06). Requires a long-context embedding model; we use the standard short-context one. Worth revisiting once you have the basic pipeline working.
  • Quantize the embeddings. A future optimization: store int8 vectors (4× smaller) instead of float32. sqlite-vec supports this via int8[384] instead of float[384]. Worth it once your corpus passes ~5M chunks.

Next

Step 08 is retrieval done right. Dense retrieval (what we just built) finds semantic matches but misses exact keywords. BM25 (a lexical method) finds keywords but misses paraphrases. Both together — combined via reciprocal rank fusion and refined with a cross-encoder reranker — is the production standard. We’ll add both, ship the full retrieval pipeline, and measure recall.