Your retriever is returning half-sentences and you wonder why

Your RAG pipeline chunks documents at every 500 characters with a fixed splitter, not a recursive character splitter that respects document structure. You ask a question, the retriever returns 5 "relevant" chunks, and every one of them starts mid-sentence. "...tion middleware checks the token against Redis..." is not useful to the model. It cannot tell what the subject is. Its answer is vague and probably wrong.

This is the cost of chunking by character count alone. You cut across sentence boundaries, paragraph boundaries, code block boundaries, and semantic boundaries without caring. The retriever returns technically correct neighbors that are semantically gibberish.

RecursiveCharacterTextSplitter is the fix. It is the most underrated piece of the LangChain toolkit and the first chunker I reach for on any new RAG project. This post is how it works, why the separator order matters, the overlap math that keeps context flowing, and the tuning rules that beat naive chunking on real corpora.

Why does fixed-size chunking break RAG retrieval quality?

Because natural text has hierarchy and fixed-size splitting ignores it. A markdown doc has headings, paragraphs, sentences, words, and characters as nested levels of meaning. Splitting at character 500 will land inside a word half the time, inside a sentence almost always, and at a paragraph boundary almost never.

3 concrete ways this hurts retrieval:

  1. Embedding quality drops. An embedding model was trained on coherent text. A chunk that starts mid-sentence produces a weaker embedding because the model never saw "...tion middleware checks..." in training.
  2. Reranker scoring drops. Rerankers read the chunk as text and decide relevance. A half-sentence has less signal and ranks lower than a full paragraph, even when the underlying content is the same.
  3. Generator context is noisy. The LLM reads the chunk and has to guess what the missing start and end were. Sometimes it guesses wrong and answers off the data.

Fix by chunking at boundaries that preserve meaning. Recursive splitting is the algorithm.

graph TD
    Doc[Long document] --> Try1[Try split by double newline]
    Try1 -->|chunk too big| Try2[Try split by single newline]
    Try2 -->|still too big| Try3[Try split by period + space]
    Try3 -->|still too big| Try4[Try split by space]
    Try4 -->|last resort| Char[Split by character]
    Try1 -->|chunk fits| Done[Chunk ready]
    Try2 -->|chunk fits| Done
    Try3 -->|chunk fits| Done

    style Done fill:#dcfce7,stroke:#15803d

The splitter walks down a hierarchy of separators and uses the first one that produces chunks that fit the size budget. Higher-level separators are tried first so chunks land on paragraph breaks when possible and only fall back to word or character splits when necessary.

How does RecursiveCharacterTextSplitter work?

It takes an ordered list of separators (default: ["\n\n", "\n", " ", ""]) and a target chunk size. It recursively splits the text by each separator in order. If a resulting piece is still larger than the target, it tries the next separator on that piece. If it reaches the end of the list without success, it hard-splits by character.

The recursion is the key word. Each level only splits pieces that are too big. Pieces that already fit are kept intact at the highest-level boundary they already respect.

# filename: splitter_basic.py
# description: The minimum working example of recursive character splitting.
# LangChain ships this as RecursiveCharacterTextSplitter.
from langchain_text_splitters import RecursiveCharacterTextSplitter


splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=['\n\n', '\n', '. ', ' ', ''],
)

text = open('docs/architecture.md').read()
chunks = splitter.split_text(text)
print(f'{len(chunks)} chunks, avg {sum(len(c) for c in chunks) // len(chunks)} chars')

5 lines to split a document into semantically coherent chunks. The separator order is what makes it work: paragraphs first, lines second, sentences third, words fourth, characters last. The splitter respects the highest-level boundary that fits the size budget.

Why is the separator order the whole game?

Because the order encodes the hierarchy of meaning in your documents. Pick the wrong order and chunks split in the wrong places.

3 common corpora and their right separator orders:

  1. Prose documentation: ['\n\n', '\n', '. ', ' ', '']. Paragraphs, lines, sentences, words, characters. The default.

  2. Markdown with code blocks: ['\n## ', '\n### ', '\n\n', '\n', '. ', ' ', '']. Headings first, then paragraphs. Keeps sections intact.

  3. Source code: ['\nclass ', '\ndef ', '\n\n', '\n', ' ', '']. Classes and functions are the meaningful boundaries. Never split in the middle of a function.

The LangChain RecursiveCharacterTextSplitter.from_language() helper has good defaults for Python, JavaScript, and other common languages. For custom shapes, write your own separator list from the top of the document hierarchy down to characters.

What is chunk_overlap and how do you set it?

Overlap is the number of characters from the end of one chunk that get copied to the start of the next. It exists because semantically coherent answers sometimes span chunk boundaries, and a small overlap means the retriever can return either neighbor and still have enough context.

The rule of thumb: set overlap to 10 to 20 percent of chunk size. With chunk_size=500, overlap of 50 to 100 characters is typical. Below 50, you lose boundary-spanning context. Above 200, you pay for duplicated tokens on every retrieval.

The math: if a sentence "...the auth flow uses Redis for session storage..." lands exactly at chunk boundary 500, a zero-overlap split loses the sentence in both chunks. An overlap of 100 keeps at least 100 characters of context on both sides so the retriever has a chance to find the full sentence in one chunk or the other.

Too much overlap is wasteful. Too little creates gaps. 10 to 20 percent is the safe range.

How do you pick chunk_size for your embedding model?

From the model's training distribution, not from guessing. Most sentence embedding models were trained on pieces of 256 to 512 tokens. Chunks above that size get truncated during embedding, which silently loses the tail of the chunk. Chunks well below that size underuse the model.

Rule: chunk_size in characters should land around 1000 to 2000, because 1 token is roughly 4 characters. That puts a 1500-character chunk at roughly 375 tokens, which fits comfortably inside the 512-token window of most embedding models.

For long-context embedding models (BGE-M3, Voyage AI large, OpenAI text-embedding-3 large), chunk sizes up to 8000 characters are fine. For older models (all-MiniLM-L6-v2, all-mpnet-base-v2), stay at 1000 to 1500 to avoid truncation.

Check the model's max sequence length before tuning. A chunk size that exceeds it is silently broken.

For the broader picture of how chunk size interacts with embedding quality, see the Choosing an Embedding Model for RAG post. For the rest of the RAG stack around the chunker, the free RAG Fundamentals primer is the right starting point.

Why does the splitter matter more than the embedding model?

Because a bad chunking strategy wastes any embedding model, no matter how good. An embedding that is computed over a half-sentence chunk is weaker than one over a full paragraph, regardless of the model. And rerankers, grounders, and generators all consume the same chunks downstream.

3 places where chunk quality propagates:

  1. Retrieval. Better chunks produce better cosine similarities because the embedding captures more meaning.
  2. Reranking. Cross-encoder rerankers read the chunk as text and score against the query. Coherent chunks score higher.
  3. Generation. The LLM reads the chunk and cites it. Coherent chunks lead to clean citations; half-sentences lead to hallucinated context.

The corollary: if you are getting bad retrieval, fix your chunking before you swap embedding models. A splitter change is free; an embedding migration is expensive.

For the full production RAG stack that uses recursive splitting alongside reranking and grounding, the Agentic RAG Masterclass covers it module by module.

What are the 3 failure modes to watch for?

First, chunks that are too small (under 200 characters). They lack context and the embedding is noisy. Usually caused by chunk_size set too aggressively low. Raise it to 500 minimum.

Second, chunks that are too big (over the embedding model's token limit). Silent truncation loses information. Run tokenizer(chunk) to check the actual token count against the model's max sequence length and fix the chunk size accordingly.

Third, chunks that span document boundaries. If you concatenate 2 documents before splitting, the splitter cannot tell them apart and may produce a chunk with the end of doc A and the start of doc B. Always split per document and keep the resulting chunks tagged with the source ID so the retriever does not mix them later.

What to do Monday morning

  1. Count your current chunks. Run len(chunk) on a sample of 100. If the histogram shows most under 200 or most over 3000, your chunk_size is wrong.
  2. Switch to RecursiveCharacterTextSplitter if you are using a fixed-size splitter. Use the default separators first, tune only if retrieval quality is obviously wrong.
  3. Set chunk_overlap to 10 to 20 percent of chunk_size. Nothing more, nothing less, unless you have a specific reason.
  4. For markdown documents, add heading separators to the top of the separator list. Section-preserving chunks improve retrieval on documentation corpora noticeably.
  5. Re-run your RAG eval. Expect a 5 to 15 point accuracy lift on questions where previous chunks were mid-sentence.

The headline: recursive character splitting is the default chunker for a reason. It respects natural document boundaries, the separator order teaches it your corpus's hierarchy, and the 10 to 20 percent overlap rule handles boundary-spanning context. Change it once and move on.

Frequently asked questions

What is recursivecharactertextsplitter?

It is a chunking algorithm that splits text by trying an ordered list of separators (paragraphs, lines, sentences, words, characters) and using the highest-level one that produces chunks fitting the size budget. Unlike fixed-size splitters, it preserves semantic boundaries whenever possible, which improves retrieval quality on realistic corpora.

Why is recursivecharactertextsplitter better than fixed-size chunking?

Because fixed-size splitters cut across sentence and paragraph boundaries, producing half-sentences that have weaker embeddings, rank lower in reranking, and confuse generators downstream. Recursive splitting uses the same character budget but respects natural boundaries first, which produces chunks that are semantically coherent without being any larger.

How do I choose chunk_size for my embedding model?

Check the embedding model's max sequence length and stay below it. For models trained on 512-token windows (most sentence transformers), target chunk sizes of 1000 to 1500 characters. For long-context embedding models like BGE-M3 or OpenAI text-embedding-3, chunks up to 8000 characters are safe. Exceeding the model's token limit causes silent truncation and lost information.

What should chunk_overlap be set to?

Between 10 and 20 percent of chunk_size. For a 500-character chunk size, 50 to 100 characters of overlap is typical. Too little creates gaps where boundary-spanning content is lost in both neighbors; too much wastes tokens by duplicating content. The 10 to 20 percent range is the safe default for almost every corpus.

How does the separator order affect chunking quality?

It encodes your document hierarchy. Prose documents want \n\n before \n before . before . Markdown wants heading markers first. Source code wants class and def first. The splitter tries separators in order and uses the first one that produces chunks within budget, so the order dictates which boundaries are preserved.

Key takeaways

  1. Fixed-size chunking breaks retrieval by cutting across sentence and paragraph boundaries. Recursive splitting respects natural hierarchy at the same character budget.
  2. The separator order is the whole game. Default ['\n\n', '\n', '. ', ' ', ''] works for prose; markdown and code need custom orders that start with heading markers or class/function boundaries.
  3. Set chunk_size from your embedding model's max sequence length. 1000 to 1500 characters for standard models, up to 8000 for long-context embeddings.
  4. Set chunk_overlap to 10 to 20 percent of chunk_size. Lower loses boundary context; higher wastes tokens.
  5. Fix chunking before swapping embedding models. A splitter change is free and often improves retrieval more than a model migration would.
  6. To see recursive splitting wired into a full production RAG stack with reranking, grounding, and self-correction, walk through the Agentic RAG Masterclass, or start with the RAG Fundamentals primer.

For the full LangChain documentation on text splitters, language-specific variants, and advanced splitting patterns, see the LangChain text splitters guide. The from_language() helpers there are worth knowing for code-heavy corpora.

Share this post

Continue Reading

Weekly Bytes of AI

Technical deep-dives for engineers building production AI systems.

Architecture patterns, system design, cost optimization, and real-world case studies. No fluff, just engineering insights.

Unsubscribe anytime. We respect your inbox.

Ready to go deeper?

Go beyond articles. Build production AI systems with hands-on workshops and our intensive AI Bootcamp.