Anti-pattern
Char-count chunking on prose
- Taxonomy
anti_patterns.chunking- Severity
- common
- Symptom
- Retrieved chunks start mid-sentence, end mid-clause, and miss the topic anchor at the top of the section. Reranker can't recover what the chunker discarded.
- Root cause
- Char windows ignore sentence, paragraph, and section boundaries. The "topic sentence" lands in one chunk and its supporting evidence in the next, so neither chunk on its own answers the query.
- Fix
- Use a recursive/semantic splitter that respects paragraph + sentence boundaries. Add small overlap (10-15%) and a header-aware prefix.
- First documented
- 2023
Why it keeps happening
Every “your first RAG in 50 lines” tutorial ships with a 1000-char fixed splitter because it’s the smallest amount of code that compiles. People deploy that splitter and then debug retrieval for weeks.
How to spot it
- Eval queries that ask “what does the author conclude about X” return chunks that contain X but no conclusion.
- Average chunk starts with a lowercase letter or a comma.
- Rerankers help on isolated queries but the overall metric barely moves — because the evidence the reranker needs simply isn’t in any single chunk.
The fix
- Switch to a recursive splitter that tries paragraph → sentence → word in order, with a target size band (e.g. 512-1024 tokens).
- Add header-aware prefixing: prepend the enclosing section’s heading to every chunk derived from it.
- Use modest overlap (10-15%) so a topic sentence near a boundary appears in both adjacent chunks.
- For long-form structured docs, consider semantic chunking (embed candidate boundaries, split where local similarity dips).
Related
- Solved by: contextual-chunking, semantic-chunking patterns.
- Often co-occurs with: ignoring document structure, single-retriever pipelines.