step 07 · ship · building
Embeddings + vector store
Pick an embedding model, persist with sqlite-vec, retrieve at 5 ms. Plus: why you probably don't need a dedicated vector DB.
You have chunks. Now you need a way to find the right ones quickly when a user asks a question. That’s what embeddings + a vector store do: convert each chunk to a fixed-length vector, store it, and at query time find the chunks whose vectors are closest to the query’s vector.
This step picks the embedding model, picks the vector store, and ships both behind a clean interface. The opinionated take: sqlite-vec is the right vector store for ~80% of production deployments. We’ll defend that, then build it.
Pick an embedding model
The space looks more confusing than it is. In 2026, three categories cover virtually all use cases:
Local sentence-transformers (the default)
sentence-transformers/all-MiniLM-L6-v2 is the workhorse. 384 dimensions, ~25 MB on disk, 100k embeddings per second on a laptop CPU. Quality is good enough for most retrieval. This is what we use.
For higher quality, BAAI/bge-large-en-v1.5 (1024 dimensions) or Snowflake/snowflake-arctic-embed-m (768 dimensions) are the modern leaderboard champions. ~10× slower, somewhat better recall.
API embeddings
OpenAI’s text-embedding-3-small (1536 dim) and text-embedding-3-large (3072 dim). Voyage AI’s voyage-3 and voyage-code-2. Cohere embed-v3.
When to use: if you’re already paying for an API and don’t want to manage local inference. Quality is competitive with the best OSS but you pay per-call (~$0.02 per 1M tokens) and you’re round-tripping every query through the network. For self-hosted RAG, this defeats the point.
Specialized models
Qwen-Embedding, BGE-M3 (multilingual + multi-vector), nomic-embed-text (long context). Each shines at a niche; for general English retrieval, MiniLM-L6 is hard to beat at its size.
We use all-MiniLM-L6-v2 for the rest of the curriculum. 384 dimensions, fast, good enough, no API calls.
Pick a vector store
Your options, ranked by “how much you should reach for this in 2026”:
sqlite-vec (the default)
A SQLite extension that adds vector search. The whole vector store is one .db file. No server, no daemon, no config. Embeddable in any Python application. ANN-quality search (HNSW) at hundreds of QPS on a single machine, scales to ~10M vectors comfortably.
Use when: anything under ~10M chunks, single-machine deployments, applications where simplicity wins. Default starting point.
pgvector
PostgreSQL extension. If you already have Postgres, this is free — your vectors live alongside your application data and you transact across both. Same ANN performance as sqlite-vec.
Use when: you already have a Postgres database, you want one storage system for everything.
Chroma / Qdrant / Weaviate
Dedicated vector databases. More features (per-namespace permissions, metadata filtering, multi-tenant isolation). Run as a server.
Use when: very high query volume, multi-tenant, or you genuinely need features the SQL-extensions don’t have.
Pinecone / Turbopuffer / Vertex Vector Search
Managed cloud vector services. Pay-per-use, no operational overhead.
Use when: you’re already on the corresponding cloud and team prefers managed services.
The unfashionable truth: for most production RAG, the right choice is sqlite-vec or pgvector. Dedicated vector DBs are operational overhead you don’t need until you’re past 10M+ chunks or have specific multi-tenant requirements.
We use sqlite-vec.
Setup
uv add sentence-transformers sqlite-vec numpy
Three packages, no GPU required, fully offline once models are downloaded.
# stack/embed.py
from __future__ import annotations
import json
import sqlite3
from dataclasses import dataclass
from pathlib import Path
from typing import Iterable
import numpy as np
import sqlite_vec
from sentence_transformers import SentenceTransformer
from stack.chunk import Chunk
# Default embedding model. 384-d, ~25 MB on disk, 100k/sec on CPU.
EMBED_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
EMBED_DIM = 384
class Embedder:
"""Lazy-loaded sentence-transformers wrapper.
The first call to .embed() loads the model (~3 sec). Subsequent
calls are fast. Thread-safe for read-only inference.
"""
def __init__(self, model_name: str = EMBED_MODEL_NAME):
self.model_name = model_name
self._model: SentenceTransformer | None = None
@property
def model(self) -> SentenceTransformer:
if self._model is None:
self._model = SentenceTransformer(self.model_name)
return self._model
def embed(self, texts: list[str], normalize: bool = True) -> np.ndarray:
"""Embed a list of strings; returns (n, EMBED_DIM) float32 array.
normalize=True (default) projects to unit length so cosine
similarity reduces to dot product, which is what sqlite-vec
scores against under the hood.
"""
return self.model.encode(
texts,
normalize_embeddings=normalize,
show_progress_bar=False,
).astype(np.float32)
@dataclass
class Hit:
chunk_id: int
score: float
text: str
metadata: dict
Two pieces: a lazy-loaded embedder, and a Hit dataclass for retrieval results. Now the store.
The vector store
# stack/embed.py (continuing)
class VectorStore:
"""SQLite-backed vector store using the sqlite-vec extension.
Stores chunks + their embeddings in two tables:
• chunks — id, text, metadata (json), source
• chunk_vec — id (FK to chunks.id), embedding (vector blob)
Lookup: cosine similarity (via dot product on normalized vectors).
"""
def __init__(self, db_path: str | Path = "data/chunks.db"):
self.db_path = Path(db_path)
self.db_path.parent.mkdir(parents=True, exist_ok=True)
self._conn = self._connect()
self._init_schema()
def _connect(self) -> sqlite3.Connection:
conn = sqlite3.connect(self.db_path)
conn.enable_load_extension(True)
sqlite_vec.load(conn)
conn.enable_load_extension(False)
return conn
def _init_schema(self) -> None:
cur = self._conn.cursor()
cur.execute("""
CREATE TABLE IF NOT EXISTS chunks (
id INTEGER PRIMARY KEY AUTOINCREMENT,
text TEXT NOT NULL,
metadata TEXT NOT NULL,
source TEXT
)
""")
# vec0 is sqlite-vec's virtual table for vector indexes.
cur.execute(f"""
CREATE VIRTUAL TABLE IF NOT EXISTS chunk_vec USING vec0(
embedding float[{EMBED_DIM}]
)
""")
self._conn.commit()
def add(self, chunks: list[Chunk], embeddings: np.ndarray) -> list[int]:
"""Insert chunks + their embeddings; return the new IDs."""
assert len(chunks) == embeddings.shape[0]
cur = self._conn.cursor()
ids: list[int] = []
for chunk, emb in zip(chunks, embeddings):
cur.execute(
"INSERT INTO chunks (text, metadata, source) VALUES (?, ?, ?)",
(chunk.text, json.dumps(chunk.metadata), chunk.metadata.get("source")),
)
chunk_id = cur.lastrowid
cur.execute(
"INSERT INTO chunk_vec (rowid, embedding) VALUES (?, ?)",
(chunk_id, emb.tobytes()),
)
ids.append(chunk_id)
self._conn.commit()
return ids
def search(self, query_embedding: np.ndarray, k: int = 5) -> list[Hit]:
"""Return top-k chunks by cosine similarity to the query."""
cur = self._conn.cursor()
cur.execute(
"""
SELECT
c.id, c.text, c.metadata, v.distance
FROM chunk_vec AS v
JOIN chunks AS c ON c.id = v.rowid
WHERE v.embedding MATCH ?
AND k = ?
ORDER BY v.distance ASC
""",
(query_embedding.astype(np.float32).tobytes(), k),
)
results: list[Hit] = []
for chunk_id, text, metadata_json, distance in cur.fetchall():
# sqlite-vec returns L2 distance on the (normalized) vectors.
# For unit-length vectors, similarity = 1 - 0.5 * L2².
score = 1.0 - 0.5 * (distance ** 2)
results.append(Hit(
chunk_id=chunk_id,
score=score,
text=text,
metadata=json.loads(metadata_json),
))
return results
def count(self) -> int:
cur = self._conn.cursor()
cur.execute("SELECT COUNT(*) FROM chunks")
return cur.fetchone()[0]
def close(self) -> None:
self._conn.close()
Key design choices:
- Two tables, one row per chunk.
chunksholds the text + metadata;chunk_vecholds the vector. They sharerowid(chunk_id). One join at query time. - Cosine via L2.
sqlite-vecreturns L2 distance. For unit-length (normalized) vectors,cosine(a, b) = 1 - 0.5 · L2(a, b)². We convert at the boundary so callers see ascore ∈ [0, 1], where higher is better. - Metadata as JSON in a TEXT column. Cheap and flexible. If you need queryable filters (e.g.
WHERE source = 'foo.md') you’d add per-key columns; for the curriculum, JSON is fine.
Pipeline: chunk → embed → store
A small index_documents() helper ties the pieces together:
# stack/embed.py (continuing)
def index_documents(
docs: list[tuple[str, dict]], # (text, metadata) pairs
chunker, # any Chunker from step 06
store: VectorStore,
embedder: Embedder,
batch_size: int = 64,
) -> int:
"""Run the full chunk → embed → store pipeline. Returns # chunks indexed."""
all_chunks: list[Chunk] = []
for text, metadata in docs:
all_chunks.extend(chunker.split(text, metadata))
print(f" produced {len(all_chunks)} chunks")
# Batched embedding. The model is fast but loading is the cost;
# batching amortizes the per-call overhead.
n_indexed = 0
for i in range(0, len(all_chunks), batch_size):
batch = all_chunks[i : i + batch_size]
texts = [c.text for c in batch]
embeddings = embedder.embed(texts)
store.add(batch, embeddings)
n_indexed += len(batch)
return n_indexed
Sanity check
# stack/embed.py (bottom)
if __name__ == "__main__":
import time
from stack.chunk import MarkdownChunker, SAMPLE_DOC
# Set up a clean store (delete first if you want to re-run idempotently).
store_path = Path("data/chunks.db")
if store_path.exists():
store_path.unlink()
store = VectorStore(store_path)
embedder = Embedder()
chunker = MarkdownChunker(max_tokens=200)
# Index the sample doc.
print("─── indexing ───")
docs = [(SAMPLE_DOC, {"source": "db-docs.md"})]
n = index_documents(docs, chunker, store, embedder)
print(f" indexed: {n} chunks")
print(f" total in store: {store.count()}")
# Run a few queries.
print("\n─── retrieval ───")
queries = [
"How do I install PostgreSQL?",
"What's the format for the connection string?",
"How do I run database migrations?",
]
for query in queries:
t0 = time.perf_counter()
q_emb = embedder.embed([query])[0]
hits = store.search(q_emb, k=2)
elapsed_ms = (time.perf_counter() - t0) * 1000
print(f"\n Q: {query}")
print(f" [{elapsed_ms:.1f} ms]")
for hit in hits:
preview = hit.text[:60].replace("\n", " ⏎ ")
headings = " > ".join(hit.metadata.get("headings", []))
print(f" {hit.score:.3f} {headings}")
print(f" {preview}…")
store.close()
Run:
uv run python -m stack.embed
Expected output (timings vary, scores depend on the embedder; output approximate):
─── indexing ───
produced 4 chunks
indexed: 4 chunks
total in store: 4
─── retrieval ───
Q: How do I install PostgreSQL?
[4.7 ms]
0.624 Setting Up the Database > Installing the Server
## Installing the Server ⏎ ⏎ We use PostgreSQL 16. On macO…
0.418 Setting Up the Database
# Setting Up the Database ⏎ ⏎ Before you can run the appli…
Q: What's the format for the connection string?
[3.2 ms]
0.681 Setting Up the Database > Connection Strings
## Connection Strings ⏎ ⏎ The application reads `DATABASE_…
0.342 Setting Up the Database > Installing the Server
## Installing the Server ⏎ ⏎ We use PostgreSQL 16. On macO…
Q: How do I run database migrations?
[3.4 ms]
0.706 Setting Up the Database > Creating the Schema
## Creating the Schema ⏎ ⏎ Run the migrations from the pro…
0.395 Setting Up the Database > Connection Strings
## Connection Strings ⏎ ⏎ The application reads `DATABASE_…
What to notice:
- Each query returns the right top-1. “Install PostgreSQL?” → “Installing the Server” section. “Connection string?” → “Connection Strings” section. Dense embedding retrieval works.
- The score gap between #1 and #2 is meaningful. First hit is ~0.6+, second drops to ~0.4. That gap is the model’s confidence; if the gap collapses, retrieval is weak.
- Retrieval is ~3–5 ms per query at this corpus size. At 50K chunks, expect 10–30 ms; at 1M, 30–100 ms. sqlite-vec stays fast through the range that matters.
- Heading metadata is preserved. When the retriever returns a hit, the chunk’s place in the document hierarchy comes along for free. The next step uses it.
Performance numbers worth remembering
Rough operating points for MiniLM-L6 + sqlite-vec on a laptop CPU:
| Corpus size | DB size | Index build time | Query latency |
|---|---|---|---|
| 1K chunks | ~5 MB | ~3 sec | ~1 ms |
| 100K chunks | ~500 MB | ~5 min | ~5 ms |
| 1M chunks | ~5 GB | ~50 min | ~20 ms |
| 10M chunks | ~50 GB | ~8 hr | ~80 ms |
These are real numbers (no GPU). At 10M+ chunks you start wanting a real ANN index (HNSW with proper tuning, FAISS, or moving to a dedicated vector DB). Below that, sqlite-vec is fine.
Cross-references
- Embedding Playground demo — the same
MiniLM-L6model running in your browser; pick two phrases and watch cosine similarity update - Chunking Strategies demo — the chunkers from step 06 visualized
- RAG Visualizer demo — preview of step 08; shows dense vs BM25 vs hybrid retrieval
What we did and didn’t do
What we did:
- Lazy-loaded
sentence-transformers/all-MiniLM-L6-v2for embeddings - Stood up a sqlite-vec backed store with proper schema, batched indexing, cosine retrieval
- Indexed real chunks from step 06’s chunker; retrieval works at ~5 ms
- Preserved chunk metadata (including heading hierarchy) through the pipeline
What we didn’t:
- Implement BM25 retrieval. That’s step 08 — keyword retrieval to complement dense retrieval, both ranked together via reciprocal rank fusion.
- Add metadata filtering. Real production wants
WHERE source LIKE 'docs/%'orWHERE created_at > ?. The schema already has thesourcecolumn ready; we’ll add typed filter support in step 08. - Implement late chunking (mentioned in step 06). Requires a long-context embedding model; we use the standard short-context one. Worth revisiting once you have the basic pipeline working.
- Quantize the embeddings. A future optimization: store int8 vectors (4× smaller) instead of float32. sqlite-vec supports this via
int8[384]instead offloat[384]. Worth it once your corpus passes ~5M chunks.
Next
Step 08 is retrieval done right. Dense retrieval (what we just built) finds semantic matches but misses exact keywords. BM25 (a lexical method) finds keywords but misses paraphrases. Both together — combined via reciprocal rank fusion and refined with a cross-encoder reranker — is the production standard. We’ll add both, ship the full retrieval pipeline, and measure recall.