demo
Where you cut a document changes everything
Bad chunking is the #1 cause of bad RAG. Same text, four strategies — see how the chunks differ in size, in shape, and in whether they preserve meaning at the boundaries.
The four strategies in one paragraph each
- Fixed by chars — the laziest default. Split every N characters with optional overlap. Doesn't respect words, sentences, or paragraphs. Useful only when you need guaranteed deterministic chunk sizes; bad otherwise.
- Fixed by tokens — the standard production choice. Token count maps to embedder input limit and to retrieval cost. Still a syntactic chop, but at least the unit is the right one.
- Sentence boundary — splits on
. ! ?followed by whitespace. Each chunk is a full sentence — meaning is preserved. Sizes are uneven, so you usually merge small sentences and split huge ones. - Paragraph — splits on blank lines. Each chunk is a coherent block of thought. Best when your source already has structure (Markdown, docs, blog posts). Useless on wall- of-text.
Things to try
-
Stick with the default RAG document. Switch between strategies
— notice how
fixed by charsat 280 / overlap 40 cuts mid-sentence at chunk boundaries (the cyan-highlighted overlap region helps you see). -
Switch to
tokensat 100. The boundaries shift because tokens are not equally-sized chars (e.g.·Retrievalis one token,·is part of it). -
Paste a wall-of-text without paragraph breaks. Try
paragraphmode — you'll get one giant chunk. That's the failure mode.
Production patterns
Real systems hybridize: paragraph-aware splitting first, sentence-merge to a target token size, with character-level overlap at chunk boundaries to recover information lost to the cut. The "late chunking" pattern (Jina AI 2024) goes a step further — it embeds the FULL document in one pass and only then slices the resulting embeddings, getting context-aware chunk vectors without the boundary problem.
Anchored to 09-rag/chunking-strategies
from the learning path. Tokenization courtesy of Tokenizer Surgery.