Chunking Strategies

Documents are long; embedding models have token limits; LLMs have context windows. Chunking is how you slice documents into retrievable units. It’s the most underrated lever in RAG.

The tradeoffs

A chunk that’s too small:

  • Loses context (a sentence without its paragraph).
  • Splits relevant facts across multiple chunks.
  • Forces the LLM to assemble fragments.

A chunk that’s too large:

  • Has too much irrelevant text (lower retrieval precision).
  • Hits embedding model token limits.
  • Bloats the prompt (cost, latency).
  • “Distracts” the LLM with off-topic content.

Strategy 1 — Fixed-size chunks

Slice every N tokens (or characters), with overlap.

def fixed_chunks(text, size=512, overlap=50):
    tokens = tokenize(text)
    chunks = []
    for i in range(0, len(tokens), size - overlap):
        chunks.append(detokenize(tokens[i:i + size]))
    return chunks

Pros:

  • Simple, fast, predictable.
  • Easy to reason about token counts.

Cons:

  • Splits sentences and paragraphs arbitrarily.
  • Ignores document structure entirely.

Often the right baseline. With a good embedding model and overlap, fixed chunking gets you ~80% of the way.

Typical sizes:

  • Short (256 tokens): higher precision, more chunks to retrieve.
  • Medium (512–1024): balanced default.
  • Long (2048+): fewer, more contextful chunks. Good for narrative-heavy text.

Overlap: 10–20% of chunk size. Prevents losing information at boundaries.

Strategy 2 — Recursive character splitting

LangChain popularized this. Split on a hierarchy of separators:

  1. Try to split on "\n\n" (paragraph).
  2. If chunks are still too large, split on "\n".
  3. Then ". ".
  4. Then space.
  5. Then character.

Stops as soon as chunks fit the size budget.

from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_text(document)

Pros:

  • Respects natural language structure better than fixed.
  • Easy to use.

Cons:

  • Still ignores document semantics.
  • Doesn’t know about headings, lists, tables.

A solid default for general text.

Strategy 3 — Sentence-based chunking

Split on sentence boundaries; group sentences up to a token budget.

import nltk
sentences = nltk.sent_tokenize(text)
chunks, current = [], []
for s in sentences:
    if token_count(current + [s]) > 500:
        chunks.append(" ".join(current))
        current = [s]
    else:
        current.append(s)
chunks.append(" ".join(current))

Pros:

  • Sentence-clean boundaries.
  • Good for short, fact-dense text (papers, FAQs).

Cons:

  • Sentence segmentation isn’t reliable across languages.
  • Doesn’t respect higher-level structure.

Strategy 4 — Structural chunking

Use the document’s structure: headings, sections, lists, tables.

For Markdown:

  • Split on ## and ### headings.
  • Each section is one chunk; subdivide if too large.

For HTML/PDF:

  • Use a layout-aware parser (e.g. PyMuPDF, Unstructured, LlamaParse).
  • Honor headings, tables, lists.

Pros:

  • Aligns chunks with semantic units a human would recognize.
  • Tables / code blocks stay intact.
  • Heading context is naturally preserved.

Cons:

  • Requires document-aware parsing.
  • Variable chunk sizes — some sections huge, some tiny.

Worth the effort for technical documentation, legal contracts, scientific papers, codebases.

Strategy 5 — Semantic chunking

Use an embedding model to detect topic shifts: split where consecutive sentences become semantically distant.

For each pair of adjacent sentences (s_i, s_{i+1}):
    similarity = cos(embed(s_i), embed(s_{i+1}))
    if similarity < threshold:
        split here

Pros:

  • Captures real topic boundaries.
  • Often produces high-quality chunks for diverse docs.

Cons:

  • Slow at index time (extra embedding calls).
  • Sensitive to threshold tuning.
  • Doesn’t always beat structural chunking on well-formatted docs.

Ranges: try thresholds 0.3–0.7; tune on a sample.

Strategy 6 — Late chunking (2024)

Apply chunking after running the document through an embedding model that supports long contexts.

Embed the whole document (or large window) at once → produces token-level embeddings
Pool token embeddings within sliding windows to produce chunk embeddings

Each chunk’s embedding now reflects the whole document context, not just the chunk itself. Significantly improves retrieval for documents where chunks lack standalone context.

Supported by Jina-embeddings-v3 and some research models. Likely the future direction.

Strategy 7 — Sliding window with summary headers

For each chunk, prepend a summary of the surrounding context:

[Section: Introduction]
[Document: API Reference v3]
<actual chunk text>

Now retrieval can match on the summary even if the literal text doesn’t contain the keywords.

A simple version: prepend the document title and section headings. A fancier version: generate an LLM summary as the prefix.

Strategy 8 — Multi-resolution / hierarchical chunking

Index chunks at multiple sizes simultaneously:

  • Sentence-level for precise matching
  • Paragraph-level for context
  • Section-level for topical retrieval

Retrieve at the smallest size, return the surrounding larger chunk to the LLM (called “small-to-big” retrieval).

chunk_id_to_parent_id = {...}
results = retrieve_at_sentence_level(query)
parent_chunks = [parent_for(c) for c in results]
return parent_chunks

Often a quality boost without a major architectural change.

Strategy 9 — Code-aware chunking

For code, syntactic structure matters. Use AST-based chunking:

  • Each function = one chunk.
  • Each class = one chunk (or split if huge).
  • Imports + module docstring = one chunk.

Tools: tree-sitter, language-specific parsers (e.g. ast for Python).

For long functions, split by logical block (with comments) but keep the signature with each chunk.

Strategy 10 — Table and figure handling

Tables and figures break naive chunking:

  • A 200-row table chunked at 512 tokens loses meaning.
  • Captions on figures get separated from the figures.

Approaches:

  • Extract tables to dedicated representations (CSV, JSON), embed those.
  • Use a vision-language model to caption figures, embed captions.
  • Use specialized parsers (Unstructured, LlamaParse, AWS Textract) that preserve structure.

For tables, column-aware embedding (embed each row with its header) often works well.

Chunk metadata

Every chunk should carry metadata:

chunk = {
    "id": "doc123_chunk5",
    "text": "...",
    "metadata": {
        "doc_id": "doc123",
        "title": "API Reference",
        "section": "Authentication",
        "url": "...",
        "modified": "2026-01-15",
        "page": 14,
    }
}

Metadata enables:

  • Filtering (“only docs from 2025+”)
  • Citations
  • Re-ranking by recency, authority
  • Multi-tenant isolation

How to pick a chunking strategy

  1. Start with recursive character splitting + 512 tokens + 50 overlap.
  2. Inspect 10 random chunks. Do they make sense standalone? Are tables/lists intact?
  3. Build an eval set of queries with expected source docs.
  4. Iterate: try structural chunking if you have rich docs, semantic chunking if topics shift mid-document.
  5. Measure recall@10. Target ≥80% before doing anything fancy.

Common pitfalls

  • Too much overlap wastes storage and retrieves duplicates.
  • No overlap misses facts at chunk boundaries.
  • Chunks bigger than embedding model limits silently truncate.
  • Embedding the chunk without metadata loses retrieval signals.
  • One-size-fits-all in a heterogeneous corpus — mixed text/code/tables benefit from per-type handling.
  • Re-chunking on every index update without versioning — chunk IDs drift, breaks downstream.

Watch it interactively

  • Chunking Strategies — paste any text, toggle recursive / sentence-aware / markdown-structural, drag chunk size + overlap. Predict before clicking: at chunk size 200 + overlap 0, you’ll see mid-sentence cuts and ~5% boundary loss; at 500 + 50 overlap, cuts respect sentences but storage rises 10%. The trade-off is concrete on screen.

Build it in code

See also