production-stack building 06 / 17 22 min read · 30 min hands-on

step 06 · ship · building

RAG, the production way

What chunking actually is, why naive splits fail in production, and the four strategies that hold up. The unglamorous step that makes or breaks every RAG system.

rag

The Foundations release stood up your own LLM-as-a-service. The next three steps add retrieval — letting the model see content from your own documents at query time. That’s RAG (retrieval-augmented generation), and it’s the single most-deployed pattern in production AI.

This step is the unglamorous part: chunking. Splitting documents into pieces small enough to fit in the model’s context, large enough to carry meaning, semantically coherent enough for retrieval to find them. Sounds boring. It’s the single biggest determinant of RAG quality. A great retriever over bad chunks loses to a mediocre retriever over good chunks every time.

By the end you’ll have four chunking strategies behind a common interface, run them on a real document, and have a clear sense of which to reach for when.

Why this matters more than people think

Three failure modes that all trace back to chunking:

  1. Retrieval finds nothing useful. The right answer was in your corpus, but the chunk containing it was split mid-sentence and the embedding lost coherence. Retrieval ranks it 11th when it should be 1st.

  2. Retrieval finds something useful but only a fragment. The chunk has the right keyword but is missing the surrounding context. The model sees “the answer is 42” without the question.

  3. Retrieval finds the right thing but the model can’t use it. Chunks are too long, you can only fit two in context, and the relevant one is buried in noise. The model misses it.

Each failure mode is fixable, but not by tuning the LLM. You fix them upstream, in the chunker.

The four chunking strategies

In rough order of “how much you’d reach for this in 2026”:

1. Recursive character (the default that mostly works)

Split on paragraph breaks first. If a chunk is still too big, split on sentences. If still too big, split on character count with overlap. This is what most production systems do; LangChain’s RecursiveCharacterTextSplitter is the canonical implementation.

Use when: general English text, mixed documents, you have no specific structure to exploit. Default starting point.

2. Sentence-aware

Split on sentences (using a real sentence segmenter, not regex), then merge sentences greedily into chunks of target size. Each chunk ends at a sentence boundary. Slightly slower than recursive-character but produces cleaner embeddings.

Use when: narrative prose, long-form articles, anything where mid-sentence splits would destroy retrievability.

3. Structural / semantic

Split based on document structure: markdown headings, HTML elements, code function boundaries. Each chunk is a coherent unit of the original document.

Use when: documentation, structured corpora, knowledge bases. Almost always wins over recursive-character if your input has structure.

4. Late chunking (the 2024 idea worth knowing)

Embed the whole document first as one long sequence, then split the resulting embeddings into chunks. Each chunk’s embedding has full-document context baked in. Requires an embedding model with a large context window.

Use when: you have access to long-context embedding models (Jina-v3, BGE-M3) and your retrieval is failing because chunks lost cross-references. Newer; not yet the default; worth knowing about.

We’ll implement strategies 1, 2, 3 fully. Strategy 4 we’ll mention again in step 07 since it requires the embedding stack.

Setup

Add the dependencies. We’ll use tiktoken for token-aware sizing (the actual constraint for context windows is tokens, not characters) and pysbd for sentence segmentation:

uv add tiktoken pysbd

Create the file:

# stack/chunk.py
from __future__ import annotations
import re
from dataclasses import dataclass
from typing import Iterable, Protocol
import tiktoken
import pysbd


# We'll use the GPT-4 tokenizer for chunk-size budgeting. The exact
# tokenizer doesn't matter much for chunking — we use it as a proxy
# for context cost, not for the actual encoding the model uses.
_enc = tiktoken.get_encoding("cl100k_base")


def count_tokens(text: str) -> int:
    return len(_enc.encode(text))


@dataclass
class Chunk:
    text: str
    metadata: dict   # source, position, headings, etc.

    @property
    def n_tokens(self) -> int:
        return count_tokens(self.text)


class Chunker(Protocol):
    """Common interface every chunker satisfies."""
    def split(self, text: str, metadata: dict | None = None) -> list[Chunk]: ...

The Chunker protocol is the single contract every strategy below satisfies — input is a document, output is a list of Chunk objects with text and metadata.

Strategy 1: Recursive character

# stack/chunk.py (continuing)
class RecursiveCharChunker:
    """Split on increasingly fine-grained boundaries until under target size.

    Order: paragraph (\\n\\n) → newline → sentence-ish → space.
    Falls through to the next splitter only if the chunk is still too big.

    Stays under `target_tokens` per chunk in expectation, with `overlap_tokens`
    of context shared between adjacent chunks.
    """

    SEPARATORS = ["\n\n", "\n", ". ", " "]

    def __init__(self, target_tokens: int = 400, overlap_tokens: int = 50):
        self.target_tokens = target_tokens
        self.overlap_tokens = overlap_tokens

    def split(self, text: str, metadata: dict | None = None) -> list[Chunk]:
        meta = metadata or {}
        # Recursive descent: try the biggest splitter first, then smaller.
        chunks_text = self._recurse(text, self.SEPARATORS)
        return [Chunk(text=t, metadata=meta) for t in self._add_overlap(chunks_text)]

    def _recurse(self, text: str, separators: list[str]) -> list[str]:
        if count_tokens(text) <= self.target_tokens:
            return [text]
        if not separators:
            # Fall through: hard-cut by token count.
            return self._hard_split(text)
        sep, *rest = separators
        parts = text.split(sep)
        result: list[str] = []
        buf = ""
        for part in parts:
            candidate = buf + (sep if buf else "") + part
            if count_tokens(candidate) <= self.target_tokens:
                buf = candidate
            else:
                if buf:
                    result.append(buf)
                # If the part itself is still too big, recurse into smaller seps.
                if count_tokens(part) > self.target_tokens:
                    result.extend(self._recurse(part, rest))
                    buf = ""
                else:
                    buf = part
        if buf:
            result.append(buf)
        return result

    def _hard_split(self, text: str) -> list[str]:
        """Last resort: split on character index by approximate target."""
        chars_per_token = max(1, len(text) // count_tokens(text))
        target_chars = self.target_tokens * chars_per_token
        return [text[i : i + target_chars] for i in range(0, len(text), target_chars)]

    def _add_overlap(self, chunks: list[str]) -> list[str]:
        """Prepend the tail of chunk N to chunk N+1, in token units."""
        if self.overlap_tokens <= 0 or len(chunks) <= 1:
            return chunks
        out = [chunks[0]]
        for prev, curr in zip(chunks, chunks[1:]):
            tail_tokens = _enc.encode(prev)[-self.overlap_tokens:]
            tail = _enc.decode(tail_tokens)
            out.append(tail + " " + curr)
        return out

Two things worth highlighting:

The recursive descent. Try \n\n (paragraphs) first. If a paragraph is still too big, split it on \n (lines). If a line is too big, split on sentence-ish (. ). If a sentence is too big, split on space. Each level preserves more semantic structure than the next; we only go finer when forced.

Token-aware sizing. Targets are in tokens, not characters. A 500-character chunk in English is ~120 tokens; in Chinese it might be 500 tokens. Always size by tokens — that’s what eats your context budget.

Strategy 2: Sentence-aware

# stack/chunk.py (continuing)
class SentenceChunker:
    """Split on sentence boundaries using pysbd, merge greedily to target size.

    Cleaner than RecursiveChar for narrative prose: every chunk ends at a
    period (or equivalent) so embeddings stay coherent.
    """

    def __init__(self, target_tokens: int = 400, overlap_sentences: int = 1):
        self.target_tokens = target_tokens
        self.overlap_sentences = overlap_sentences
        self._segmenter = pysbd.Segmenter(language="en", clean=False)

    def split(self, text: str, metadata: dict | None = None) -> list[Chunk]:
        sentences = self._segmenter.segment(text)
        chunks: list[Chunk] = []
        meta = metadata or {}
        buf: list[str] = []
        buf_tokens = 0
        for sent in sentences:
            sent_tokens = count_tokens(sent)
            if buf_tokens + sent_tokens > self.target_tokens and buf:
                chunks.append(Chunk(text=" ".join(buf), metadata=meta))
                # Carry over `overlap_sentences` for context continuity.
                buf = buf[-self.overlap_sentences:] if self.overlap_sentences > 0 else []
                buf_tokens = sum(count_tokens(s) for s in buf)
            buf.append(sent)
            buf_tokens += sent_tokens
        if buf:
            chunks.append(Chunk(text=" ".join(buf), metadata=meta))
        return chunks

Cleaner failure mode than RecursiveChar: the worst case is a chunk that ends one sentence too soon, not a chunk that ends mid-clause. For narrative prose this almost always wins.

Strategy 3: Markdown structural

# stack/chunk.py (continuing)
class MarkdownChunker:
    """Split by Markdown heading hierarchy. Each chunk preserves the
    chain of parent headings as metadata, so retrieval can reconstruct
    document context."""

    HEADING_RE = re.compile(r"^(#{1,6})\s+(.+)$", re.MULTILINE)

    def __init__(self, max_tokens: int = 600):
        self.max_tokens = max_tokens
        # If a heading section overflows, fall back to a sentence chunker.
        self._fallback = SentenceChunker(target_tokens=max_tokens)

    def split(self, text: str, metadata: dict | None = None) -> list[Chunk]:
        meta = metadata or {}
        sections = self._split_by_headings(text)
        chunks: list[Chunk] = []
        for section in sections:
            section_meta = {**meta, "headings": section["headings"]}
            if count_tokens(section["text"]) <= self.max_tokens:
                chunks.append(Chunk(text=section["text"], metadata=section_meta))
            else:
                # Fall back to sentence chunking; preserve the heading metadata.
                for sub in self._fallback.split(section["text"], section_meta):
                    chunks.append(sub)
        return chunks

    def _split_by_headings(self, text: str) -> list[dict]:
        """Walk the doc, build sections rooted at each heading. Each section
        carries its full heading-chain (e.g. ['Setup', 'Database', 'Migrations'])."""
        sections: list[dict] = []
        current_text: list[str] = []
        heading_stack: list[tuple[int, str]] = []  # (depth, title)

        for line in text.splitlines():
            m = self.HEADING_RE.match(line)
            if m:
                # Flush the current section before starting the new one.
                if current_text:
                    sections.append({
                        "text": "\n".join(current_text).strip(),
                        "headings": [t for _, t in heading_stack],
                    })
                    current_text = []
                depth = len(m.group(1))
                title = m.group(2).strip()
                # Pop deeper-or-equal headings off the stack; push this one.
                heading_stack = [(d, t) for d, t in heading_stack if d < depth]
                heading_stack.append((depth, title))
                current_text.append(line)
            else:
                current_text.append(line)

        if current_text:
            sections.append({
                "text": "\n".join(current_text).strip(),
                "headings": [t for _, t in heading_stack],
            })
        return [s for s in sections if s["text"]]

The structural chunker has two big wins:

  1. Each chunk respects document structure — a section about “API authentication” stays together; you don’t get a chunk that’s the tail of API docs and the head of Database docs.
  2. Heading metadata is retained — the chunk carries headings: ["Setup", "Database", "Migrations"] so a downstream layer (the prompt template, or a re-rank scorer) can use the document’s organizational context.

Most documentation corpora benefit from this strategy more than any other.

Sanity check

Add a __main__ block:

# stack/chunk.py (bottom)
SAMPLE_DOC = """\
# Setting Up the Database

Before you can run the application, you need to configure the database.
This involves three steps: installing the database server, creating
the schema, and configuring connection strings.

## Installing the Server

We use PostgreSQL 16. On macOS:

brew install postgresql@16


On Ubuntu, use the official APT repository.

## Creating the Schema

Run the migrations from the project root:

uv run alembic upgrade head


This creates all tables, indexes, and enum types.

## Connection Strings

The application reads `DATABASE_URL` from the environment. Format:

postgresql://user:pass@host:port/dbname


For local development, use `postgresql://postgres@localhost/myapp`.
"""


if __name__ == "__main__":
    chunkers = {
        "RecursiveChar (target=120)": RecursiveCharChunker(target_tokens=120, overlap_tokens=20),
        "Sentence (target=120)":      SentenceChunker(target_tokens=120, overlap_sentences=1),
        "Markdown (max=200)":         MarkdownChunker(max_tokens=200),
    }

    for name, chunker in chunkers.items():
        chunks = chunker.split(SAMPLE_DOC, metadata={"source": "db-docs.md"})
        print(f"\n── {name} ──")
        print(f"  {len(chunks)} chunks total")
        for i, c in enumerate(chunks):
            preview = c.text[:60].replace("\n", " ⏎ ")
            print(f"  [{i}] {c.n_tokens:3d} tok · {preview}…")
            if "headings" in c.metadata:
                print(f"       headings: {c.metadata['headings']}")

Run:

uv run python -m stack.chunk

Expected output:

── RecursiveChar (target=120) ──
  3 chunks total
  [0] 119 tok · # Setting Up the Database  ⏎  ⏎ Before you can run …
  [1] 116 tok · n Ubuntu, use the official APT repository.  ⏎  ⏎ ## …
  [2]  79 tok · ation reads `DATABASE_URL` from the environment. Form…

── Sentence (target=120) ──
  3 chunks total
  [0] 110 tok · # Setting Up the Database Before you can run the appl…
  [1] 117 tok · This involves three steps: installing the database se…
  [2]  84 tok · The application reads `DATABASE_URL` from the environ…

── Markdown (max=200) ──
  4 chunks total
  [0]  47 tok · # Setting Up the Database  ⏎  ⏎ Before you can run th…
       headings: ['Setting Up the Database']
  [1]  46 tok · ## Installing the Server  ⏎  ⏎ We use PostgreSQL 16. …
       headings: ['Setting Up the Database', 'Installing the Server']
  [2]  39 tok · ## Creating the Schema  ⏎  ⏎ Run the migrations from …
       headings: ['Setting Up the Database', 'Creating the Schema']
  [3]  62 tok · ## Connection Strings  ⏎  ⏎ The application reads `DA…
       headings: ['Setting Up the Database', 'Connection Strings']

What to notice:

  • RecursiveChar splits mid-section. Chunk [1] starts with “n Ubuntu, use the official APT repository” — the previous chunk ate “On Ub” and the overlap glue smeared the boundary. This is the failure mode we predicted.
  • Sentence-aware never starts mid-sentence. Each chunk’s first token is a sentence start. Embeddings stay clean.
  • Markdown chunker produces semantic units. Each chunk is one section of the doc, with its heading hierarchy attached. A retrieval layer can later filter by “headings includes ‘Connection Strings’” or use the headings as additional signal.

For this kind of structured doc, Markdown wins decisively. For unstructured prose, Sentence wins. The recursive-character fallback is what you reach for when neither structure nor clear sentence boundaries are available.

What “production-grade chunking” looks like

The four moves above are the basics. The next-level moves a serious team would add, in order of impact:

  1. Atomic fenced regions (don’t split code blocks, tables, JSON snippets)
  2. Heading propagation in metadata (already shown in the Markdown chunker)
  3. Per-document chunker selection (route by file type — markdown chunker for .md, code chunker for .py)
  4. Cross-reference resolution (rewrite “as discussed earlier” so the chunk is self-contained)
  5. Late chunking (mentioned above; needs long-context embeddings)

Most production RAG teams do (1)–(3). Few do (4) or (5). Diminishing returns past (3) for most use cases.

Cross-references

What we did and didn’t do

What we did:

  • Three chunking strategies behind a common Chunker protocol
  • Token-aware sizing (the actual context-budget constraint)
  • Heading-hierarchy metadata on Markdown chunks for downstream prompting
  • A side-by-side comparison on a real document so the failure modes are visible

What we didn’t:

  • Implement late chunking. Mentioned but not shipped — needs the embedding model’s full context which we don’t have until step 07.
  • Handle multi-modal documents. PDFs with figures, tables, equations need separate handling. Real production extracts these via PyMuPDF or unstructured.io and routes them through specialized chunkers. Out of scope.
  • Build a chunker registry / router. A real system tags each document by source type and dispatches to the right chunker. Easy to add once you’ve picked the strategies; we have only three so the dispatch is trivial.
  • Benchmark retrieval recall against each chunker. That’s step 08’s job — we’ll come back and re-rank these chunks with all three retrieval methods then.

Next

Step 07 takes the chunks we just produced and embeds them. We’ll pick an embedding model (sentence-transformers), pick a vector store (sqlite-vec — yes, really), persist embeddings to disk, and stand up the dense-retrieval half of the RAG pipeline. The lookup will be ~5 ms per query against a corpus of 50K chunks.