Chunking in AI: A Clear Practical Guide
Chunking is an AI preprocessing method that splits large texts or datasets into smaller, manageable parts for easier NLP and LLM processing.

Chunking is the process of splitting long documents into smaller units so language models can index, retrieve, and reason over them more effectively. In retrieval systems, chunking matters because the model does not search entire books or reports at once; it searches the chunks that represent them, so chunk quality directly affects relevance, recall, and answer faithfulness.
A good chunk is small enough to fit the model and embedding limits, but large enough to preserve a coherent idea. This is why modern libraries expose several chunking strategies rather than one universal default: different document structures, domains, and retrieval goals require different boundaries.
Why chunking matters
Large documents exceed the practical input limits of embedding models and retrieval pipelines, so they must be broken into smaller retrievable units. Length-based approaches are simple and consistent, while structure-aware and semantic approaches try to preserve meaning, organization, or both.
In practice, poor chunking creates two recurring problems. If chunks are too small, meaning gets fragmented and retrieval returns isolated facts without enough context; if chunks are too large, irrelevant text dilutes similarity search and wastes context window budget.
Main chunking families
1. Fixed-size chunking
Fixed-size chunking splits text by a constant number of characters or tokens. Its main strength is predictability: every chunk follows the same limit, which makes indexing simple and efficient, especially for baselines and large pipelines.
Its weakness is semantic blindness. A chunk can begin in the middle of an argument and end in the middle of a sentence, which often harms retrieval quality for concept-heavy material such as legal, scientific, or technical documents.
from langchain_text_splitters import CharacterTextSplitter
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
encoding_name="cl100k_base",
chunk_size=300,
chunk_overlap=40,
)
chunks = text_splitter.split_text(document)When to use it:
- Fast baseline experiments.
- Very large corpora where simplicity matters.
- Pipelines where downstream reranking will compensate for rough boundaries.
2. Recursive chunking
Recursive chunking is a structure-aware strategy that tries to keep larger language units intact first, such as paragraphs, and only falls back to smaller units like sentences or words when the chunk is still too large. LangChain's RecursiveCharacterTextSplitter is built around this idea.
This is often a strong default because it balances size control with natural text flow. It usually produces more readable and more retrievable chunks than fixed slicing without requiring embeddings during splitting.
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=80,
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = text_splitter.split_text(document)When to use it:
- General-purpose RAG systems.
- Mixed unstructured text.
- Documents where paragraph boundaries roughly match ideas.
3. Semantic chunking
Semantic chunking groups sentences that are close in meaning rather than only close in position. LlamaIndex describes its SemanticSplitterNodeParser as a parser that creates nodes from groups of semantically related sentences, using an embedding model and a similarity breakpoint rule.
This method usually improves conceptual coherence, especially when topics shift inside long paragraphs. Its cost is higher complexity, because the splitter itself needs embeddings and threshold tuning, and chunk lengths can become irregular if no hard limits are added.
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding
splitter = SemanticSplitterNodeParser.from_defaults(
embed_model=OpenAIEmbedding(),
buffer_size=1,
breakpoint_percentile_threshold=95,
)
nodes = splitter.build_semantic_nodes_from_documents(documents)When to use it:
- Knowledge bases with dense conceptual transitions.
- Scientific, legal, or product documentation.
- High-precision retrieval where semantic consistency matters more than uniform size.
4. Document-structure chunking
Some documents already contain meaningful boundaries: Markdown headings, HTML sections, JSON objects, or code functions. LangChain recommends splitting such material according to its inherent structure because this preserves logical organization and often keeps semantically related text together.
This is one of the most underused strategies in production systems. If the source already defines sections, the best boundary is often the one the author wrote rather than one inferred later by a generic splitter.
from langchain_text_splitters import MarkdownHeaderTextSplitter
headers_to_split_on = [
("#", "h1"),
("##", "h2"),
("###", "h3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_text(markdown_text)Other useful structure-driven patterns include:
- HTML by section or tag.
- JSON by object or array element.
- Code by class, function, or file block.
5. Element-based and table-aware chunking
Element-based chunking splits a document according to detected elements such as titles, paragraphs, lists, tables, figures, or code blocks, rather than treating the whole file as plain running text. This makes it different from sentence or paragraph chunking: sentence and paragraph methods preserve linguistic boundaries, while element-based methods preserve document objects and layout semantics.
Table-aware chunking is a specific form of element-based chunking for structured content. Its purpose is to keep rows, headers, and cell relationships intact, because a table loses meaning when values are separated from their column names or when rows are cut across arbitrary text windows.
This distinction matters in PDFs, annual reports, invoices, scientific papers, and product catalogs. In those sources, a table is not just another paragraph; it is a compact relational structure that often needs its own parsing and serialization strategy before embedding.
def serialize_table(headers, rows):
chunks = []
for row in rows:
item = ", ".join(f"{h}: {v}" for h, v in zip(headers, row))
chunks.append(item)
return chunks
headers = ["Product", "Price", "Stock"]
rows = [
["Widget A", "$29", "150"],
["Widget B", "$35", "80"],
]
table_chunks = serialize_table(headers, rows)A practical element-based pipeline often looks like this:
- Detect document elements with a parser or OCR-layout tool.
- Keep titles, paragraphs, and lists as separate units.
- Serialize tables into row-wise or table-wise text with headers attached.
- Apply semantic or recursive splitting only inside oversized elements.
6. Sentence or paragraph chunking
Sentence-based chunking keeps sentence boundaries intact, while paragraph-based chunking keeps author-defined paragraph units. These methods are simple, interpretable, and often strong in editorial or scientific content where sentences and paragraphs already carry coherent units of thought.
They are not the same as element-based chunking. A sentence splitter operates on prose boundaries, while an element-aware splitter first asks what kind of object it is processing; for example, a paragraph, a bullet list, or a table may each require a different policy.
They are less adaptive than recursive or semantic methods, but they are easier to debug. In many systems, paragraph chunking with light overlap is a practical midpoint between raw fixed windows and more expensive semantic splitting.
paragraphs = [p.strip() for p in document.split("\n\n") if p.strip()]
chunks = []
current = ""
for p in paragraphs:
candidate = (current + "\n\n" + p).strip() if current else p
if len(candidate) < 800:
current = candidate
else:
if current:
chunks.append(current)
current = p
if current:
chunks.append(current)7. Token-aware chunking
Length constraints in modern LLM systems are ultimately token constraints, not character constraints. LangChain exposes token-based splitting because token counts align more directly with model context windows and embedding limits than raw character counts do.
This matters when working across languages, code, or markup, where the same number of characters can map to very different token counts. Token-aware chunking is therefore safer than character counting when budget accuracy is important.
from langchain_text_splitters import CharacterTextSplitter
splitter = CharacterTextSplitter.from_tiktoken_encoder(
encoding_name="cl100k_base",
chunk_size=256,
chunk_overlap=32,
)
chunks = splitter.split_text(document)Overlap and boundary control
Chunk overlap repeats a small portion of neighboring text so information near boundaries is less likely to be lost. Overlap can improve retrieval continuity, but too much overlap increases index size, redundancy, and the chance of retrieving near-duplicate passages.
A simple practical rule is to use overlap when splitting by length or recursion, and to use less overlap when chunks are already aligned to strong structures such as headings or semantic breaks. For example, a recursive splitter with 10 to 20 percent overlap is often a sensible starting point for technical prose, then adjusted after observing retrieval results.
splitter = RecursiveCharacterTextSplitter(
chunk_size=600,
chunk_overlap=90,
)Framework examples
LangChain
LangChain provides several splitter families, including recursive, token-based, and structure-based splitters for formats such as Markdown, HTML, JSON, and code. This makes it a practical choice when the chunking strategy must be swapped quickly during experimentation.
from langchain_text_splitters import (
RecursiveCharacterTextSplitter,
MarkdownHeaderTextSplitter,
)
recursive = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
plain_chunks = recursive.split_text(document)
md_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[("#", "h1"), ("##", "h2")]
)
md_chunks = md_splitter.split_text(markdown_text)LlamaIndex
LlamaIndex offers semantic splitting through SemanticSplitterNodeParser, where breakpoints are driven by semantic dissimilarity between sentence groups and can be tuned through parameters such as buffer_size and breakpoint_percentile_threshold. This is particularly useful when chunk boundaries should follow topic transitions rather than formatting symbols.
from llama_index.core import Document
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding
parser = SemanticSplitterNodeParser.from_defaults(
embed_model=OpenAIEmbedding(),
buffer_size=1,
breakpoint_percentile_threshold=95,
)
nodes = parser.build_semantic_nodes_from_documents([
Document(text=document)
])Minimal custom semantic chunker
A custom semantic chunker is useful when full framework adoption is unnecessary. The basic idea is to split into sentences, embed each sentence, compute similarity across neighbors, and open a new chunk when semantic distance crosses a threshold.
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import re
model = SentenceTransformer("all-MiniLM-L6-v2")
sentences = [s.strip() for s in re.split(r"(?<=[.!?])\s+", document) if s.strip()]
emb = model.encode(sentences)
chunks, current = [], [sentences[0]]
for i in range(1, len(sentences)):
sim = cosine_similarity([emb[i-1]], [emb[i]])[0][0]
if sim < 0.65:
chunks.append(" ".join(current))
current = [sentences[i]]
else:
current.append(sentences[i])
if current:
chunks.append(" ".join(current))Choosing the right strategy
No chunking method is best in all settings. The right method depends on the source format, the retrieval objective, and the acceptable trade-off between simplicity, cost, and semantic fidelity.
| Scenario | Recommended strategy | Why |
|---|---|---|
| Raw plain text corpus | Recursive chunking | Preserves natural boundaries while keeping size under control. |
| Markdown or HTML docs | Structure-based chunking | Headings and sections already encode meaning. |
| Scientific or legal text | Semantic chunking | Topic continuity matters more than equal chunk length. |
| PDF reports with tables | Element-based or table-aware chunking | Preserves layout objects and keeps tabular relationships intact. |
| Fast baseline system | Fixed-size or token-based chunking | Easy to implement and benchmark. |
| Code repository | Structure-based code splitting | Functions and classes are more meaningful than arbitrary windows. |
A useful engineering workflow is simple:
- Start with recursive or structure-based splitting.
- Measure retrieval quality on real queries.
- Move to semantic chunking when failures come from topic mixing or poor conceptual boundaries.
- Tune chunk size and overlap only after inspecting retrieved passages.
Common mistakes
Several chunking failures are avoidable:
- Using one global chunk size for every document type.
- Ignoring document structure even when headers or sections exist.
- Using semantic chunking without hard limits, creating unstable chunk lengths.
- Adding excessive overlap, which inflates storage and duplicates results.
- Evaluating the LLM answer quality without first inspecting the retrieved chunks.
These mistakes matter because retrieval systems fail upstream before they fail at generation. In many so-called model errors, the real issue is that the retriever received weak chunks, not that the language model reasoned badly.
Further reading
The following references are useful for readers who want to go deeper into implementation details, APIs, and production trade-offs:
- LangChain Text Splitter integrations for recursive, token-aware, and structure-based splitters.
- LangChain TextSplitter reference for the base interface and splitter behavior.
- LlamaIndex SemanticSplitterNodeParser for semantic chunking with embedding-based breakpoints.
- Pinecone: Chunking Strategies for LLM Applications for a practical overview of chunking trade-offs in retrieval systems.
Final perspective
Chunking is not a preprocessing detail; it is a modeling decision embedded inside the retrieval pipeline. Fixed-size, recursive, structure-based, element-based, sentence-level, token-aware, and semantic methods each encode a different assumption about where meaning lives in text.
A strong practical default is recursive or structure-based chunking for plain text, element-based chunking for mixed-layout documents, then semantic chunking for collections where conceptual continuity is the main retrieval challenge. The best systems treat chunking as an empirical design variable to test, not a constant to assume.
Continue Reading
