Splitting techniques for RAG: the art of the right chunk

Sunil Samson SureshSunil Samson SureshAuthor
7 min read

Share this post

In our last post, we built a RAG pipeline. The most important step, which we glossed over, was how we split our documents into chunks.

This step is the most critical part of the entire RAG system.

Why do bad chunks hurt RAG systems?

Think about it: The Retriever's only job is to find the best chunks. If our chunks are bad, the Retriever will fail, and the LLM will give a bad answer.

Imagine a textbook where every paragraph is cut in half and randomly stitched to another half-paragraph. Finding a coherent answer would be impossible, no matter how smart the student is.

The quality of your chunks determines the quality of your retrieval, which determines the quality of your final answer. A bad splitting strategy will poison your entire pipeline.

What is fixed-size splitting?

The simplest way to chunk a document is to just count a fixed number of characters (e.g., 150) and then split. We can also add an "overlap" to repeat a few characters, hoping to keep some context.

Let's see why this is often a bad idea.

# filename: example.py
# description: Code example from the post.
# This splitter just counts characters.
# It doesn't understand words or sentences.

fixed_splitter = CharacterTextSplitter(
    separator=" ",    # Just split on spaces
    chunk_size=150,   # Count to 150 characters
    chunk_overlap=20  # Repeat 20 characters in the next chunk
)

text = "One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. He lay on his armour-like back."

chunks = fixed_splitter.split_text(text)

Resulting Chunks:

Chunk 1: One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. He lay on his armour-like ba

Chunk 2: mour-like back.

Notice the disaster? It brutally cut the word "back" in half. The semantic meaning is completely broken. An LLM can't make sense of "ba... mour-like back."

Technique 2: a smarter default (recursive splitting)

A much better approach is to try splitting intelligently. We give the splitter a list of "separators" to try in order of priority.

The default list is often:

  1. Try to split on paragraphs (\n\n)
  2. If a chunk is still too big, try to split on sentences (.)
  3. If still too big, try to split on lines (\n)
  4. As a last resort, split on words ( )
# This splitter is "structure-aware"
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,    # Still a 150-char LIMIT
    chunk_overlap=20
)

text = "One morning... armour-like back.\n\nHis room... travelling salesman."

chunks = recursive_splitter.split_text(text)

Resulting Chunks:

Chunk 1: One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. He lay on his armour-like back.

Chunk 2: His room, a proper human room although a little too small, lay peacefully between its four familiar walls. A collection of textile samples lay spread out on the table - Samsa was a travelling salesman.

This is much better! The splitter saw the \n\n (paragraph break) and respected it. It created two perfect, semantically complete chunks. This is the best "default" strategy.

Technique 3: the advanced approach (semantic chunking)

What if we could split the document based on changes in topic? This is the idea behind semantic chunking. We create chunks by grouping sentences that are semantically similar.

Here's the logic:

graph TD
    A[Split text into sentences] --> B(Embed each sentence)
    B --> C{"Calculate similarity <br/> between Sentence 1 and 2"}
    C --> D{"Is similarity low?"}
    D -- Yes --> E[Start a NEW chunk]
    D -- No --> F["Keep sentences in the <br/> SAME chunk"]
    F --> G{"Calculate similarity <br/> between Sentence 2 and 3"}
    G --> D
    E --> G

Let's try this on a text with a clear topic shift:

Text:

"The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris... It is named after the engineer Gustave Eiffel...

At night, the Eiffel Tower is illuminated by a dazzling light show. Every evening, the tower sparkles for five minutes every hour..."

A semantic chunker would analyze the sentences and find that the similarity between "Gustave Eiffel" and "At night... light show" is very low. It correctly identifies this as a topic break.

Resulting Chunks:

Chunk 1: (All about the tower's history and construction) "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris... It is named after the engineer Gustave Eiffel..."

Chunk 2: (All about the tower's lighting system) "At night, the Eiffel Tower is illuminated by a dazzling light show. Every evening, the tower sparkles for five minutes every hour..."

This is an incredibly powerful technique for long documents, as it creates chunks that are perfectly coherent and focused on a single topic.

What is content-aware splitting?

Finally, the best splitters understand the type of content they are reading. If you're chunking computer code, you should split by functions or classes, not by paragraphs.

If you're chunking Markdown (like this blog post), you should split by headings (#) to keep sections together.

Text:

# Understanding LLMs

Large Language Models (LLMs) are a type of AI.

## Key Features

  • They are trained on vast amounts of text data.
  • They can generate human-like text.

Result with a Recursive Splitter (Bad):

Chunk 1: # Understanding LLMs

Large Language Models (LLMs) are a type of AI.

## Key

Chunk 2: Features

  • They are trained on vast...

Result with a Markdown-Aware Splitter (Good):

Chunk 1: # Understanding LLMs

Large Language Models (LLMs) are a type of AI.

Chunk 2: ## Key Features

  • They are trained on vast amounts of text data.
  • They can generate human-like text.

The Markdown-aware splitter knows that headings define new sections and intelligently keeps the heading and its content together in the same chunk.

Frequently asked questions

What is splitting techniques rag art right and why does it matter?

Master document chunking for RAG systems. Learn fixed-size, recursive, semantic, and content-aware splitting techniques to improve retrieval quality and reduce hallucinations. The technique matters because production AI systems need patterns that survive real traffic, not demos. This post walks through the approach, the trade-offs, and the code you can ship.

How do you implement splitting in production?

Start with the smallest working example, then add the layers that handle scale, errors, and observability. The post shows the exact code, the prompts, and the decisions that make the difference between a prototype and a service you can leave running.

When should you use splitting instead of a simpler approach?

Use it when the simpler approach is failing on real workloads: when answers are wrong, when latency is high, when costs are climbing, or when the model is hallucinating. The trade-off section in the post explains exactly when the added complexity earns its keep.

For the full reference, see the FAISS documentation.

Key takeaways

  • Chunking is foundational: Your RAG system is only as good as its chunks. "Garbage In, Garbage Out."
  • Fixed-size is risky: Simple character-based splitting will break words and sentences. Avoid it.
  • Recursive is the best default: The RecursiveCharacterTextSplitter is a reliable and smart choice for most plain text.
  • Content-aware is best: For maximum quality, use a splitter that understands your content's structure (like Markdown, Code, or Semantic topics).

For more on building production AI systems, check out our AI Engineering Bootcamp.


Take the next step

Share this post

Continue Reading

Weekly Bytes of AI

Technical deep-dives for engineers building production AI systems.

Architecture patterns, system design, cost optimization, and real-world case studies. No fluff, just engineering insights.

Unsubscribe anytime. We respect your inbox.

Ready to go deeper?

Go beyond articles. Build production AI systems with hands-on workshops and our intensive AI Bootcamp.