demo

Where you cut a document changes everything

Bad chunking is the #1 cause of bad RAG. Same text, four strategies — see how the chunks differ in size, in shape, and in whether they preserve meaning at the boundaries.

The four strategies in one paragraph each

  • Fixed by chars — the laziest default. Split every N characters with optional overlap. Doesn't respect words, sentences, or paragraphs. Useful only when you need guaranteed deterministic chunk sizes; bad otherwise.
  • Fixed by tokens — the standard production choice. Token count maps to embedder input limit and to retrieval cost. Still a syntactic chop, but at least the unit is the right one.
  • Sentence boundary — splits on . ! ? followed by whitespace. Each chunk is a full sentence — meaning is preserved. Sizes are uneven, so you usually merge small sentences and split huge ones.
  • Paragraph — splits on blank lines. Each chunk is a coherent block of thought. Best when your source already has structure (Markdown, docs, blog posts). Useless on wall- of-text.

Things to try

  1. Stick with the default RAG document. Switch between strategies — notice how fixed by chars at 280 / overlap 40 cuts mid-sentence at chunk boundaries (the cyan-highlighted overlap region helps you see).
  2. Switch to tokens at 100. The boundaries shift because tokens are not equally-sized chars (e.g. ·Retrieval is one token, · is part of it).
  3. Paste a wall-of-text without paragraph breaks. Try paragraph mode — you'll get one giant chunk. That's the failure mode.

Production patterns

Real systems hybridize: paragraph-aware splitting first, sentence-merge to a target token size, with character-level overlap at chunk boundaries to recover information lost to the cut. The "late chunking" pattern (Jina AI 2024) goes a step further — it embeds the FULL document in one pass and only then slices the resulting embeddings, getting context-aware chunk vectors without the boundary problem.

Anchored to 09-rag/chunking-strategies from the learning path. Tokenization courtesy of Tokenizer Surgery.