Anti-pattern

Char-count chunking on prose

Splitting documents into fixed N-character windows for retrieval. Cheap, ubiquitous, and quietly destroys retrieval quality on real prose.

Taxonomy: anti_patterns.chunking
Severity: common
Symptom: Retrieved chunks start mid-sentence, end mid-clause, and miss the topic anchor at the top of the section. Reranker can't recover what the chunker discarded.
Root cause: Char windows ignore sentence, paragraph, and section boundaries. The "topic sentence" lands in one chunk and its supporting evidence in the next, so neither chunk on its own answers the query.
Fix: Use a recursive/semantic splitter that respects paragraph + sentence boundaries. Add small overlap (10-15%) and a header-aware prefix.
First documented: 2023

Why it keeps happening

Every “your first RAG in 50 lines” tutorial ships with a 1000-char fixed splitter because it’s the smallest amount of code that compiles. People deploy that splitter and then debug retrieval for weeks.

How to spot it

Eval queries that ask “what does the author conclude about X” return chunks that contain X but no conclusion.
Average chunk starts with a lowercase letter or a comma.
Rerankers help on isolated queries but the overall metric barely moves — because the evidence the reranker needs simply isn’t in any single chunk.

The fix

Switch to a recursive splitter that tries paragraph → sentence → word in order, with a target size band (e.g. 512-1024 tokens).
Add header-aware prefixing: prepend the enclosing section’s heading to every chunk derived from it.
Use modest overlap (10-15%) so a topic sentence near a boundary appears in both adjacent chunks.
For long-form structured docs, consider semantic chunking (embed candidate boundaries, split where local similarity dips).

Solved by: contextual-chunking, semantic-chunking patterns.
Often co-occurs with: ignoring document structure, single-retriever pipelines.

Char-count chunking on prose

Why it keeps happening

How to spot it

The fix

Related

Related

Tools

Methods

Patterns