mem1.wiki

Anti-patterns

Anti-pattern

Char-count chunking on prose

Splitting documents into fixed N-character windows for retrieval. Cheap, ubiquitous, and quietly destroys retrieval quality on real prose.

Taxonomy
anti_patterns.chunking
Severity
common
Symptom
Retrieved chunks start mid-sentence, end mid-clause, and miss the topic anchor at the top of the section. Reranker can't recover what the chunker discarded.
Root cause
Char windows ignore sentence, paragraph, and section boundaries. The "topic sentence" lands in one chunk and its supporting evidence in the next, so neither chunk on its own answers the query.
Fix
Use a recursive/semantic splitter that respects paragraph + sentence boundaries. Add small overlap (10-15%) and a header-aware prefix.
First documented
2023

Why it keeps happening

Every “your first RAG in 50 lines” tutorial ships with a 1000-char fixed splitter because it’s the smallest amount of code that compiles. People deploy that splitter and then debug retrieval for weeks.

How to spot it

The fix

  1. Switch to a recursive splitter that tries paragraph → sentence → word in order, with a target size band (e.g. 512-1024 tokens).
  2. Add header-aware prefixing: prepend the enclosing section’s heading to every chunk derived from it.
  3. Use modest overlap (10-15%) so a topic sentence near a boundary appears in both adjacent chunks.
  4. For long-form structured docs, consider semantic chunking (embed candidate boundaries, split where local similarity dips).