How to Choose Chunk Size in Langchain for Optimal Performance
chunk size based on your document length and the model's token limit to avoid cutting important context. Typically, use chunk sizes between 500 and 1000 tokens, adjusting smaller for detailed texts and larger for simpler content to balance performance and accuracy.Syntax
When using Langchain's text splitting utilities, the chunk_size parameter defines how many characters or tokens each chunk will contain. The chunk_overlap parameter controls how much overlap exists between chunks to preserve context.
Example parameters:
chunk_size=1000: Each chunk has up to 1000 characters.chunk_overlap=200: Each chunk overlaps the previous by 200 characters.
from langchain.text_splitter import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200 ) chunks = text_splitter.split_text(long_text)
Example
This example shows how to split a long text into chunks of 500 characters with 100 characters overlapping. It demonstrates how chunk size affects the number of chunks and preserves context.
from langchain.text_splitter import RecursiveCharacterTextSplitter long_text = """Langchain helps you build applications with language models. """ * 100 # repeated text to simulate length text_splitter = RecursiveCharacterTextSplitter( chunk_size=500, chunk_overlap=100 ) chunks = text_splitter.split_text(long_text) print(f"Number of chunks: {len(chunks)}") print(f"First chunk preview: {chunks[0][:100]}...")
Common Pitfalls
Choosing chunk size too large can cause the model to exceed token limits, leading to errors or truncated input. Too small chunk sizes may lose context and reduce retrieval quality. Not using overlap can cause important information to be split awkwardly.
Always balance chunk size with your model's max token limit and the nature of your text.
from langchain.text_splitter import RecursiveCharacterTextSplitter # Wrong: chunk size too large for model text_splitter_wrong = RecursiveCharacterTextSplitter(chunk_size=5000, chunk_overlap=0) # Right: chunk size fits model limits with overlap text_splitter_right = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
Quick Reference
| Tip | Description |
|---|---|
| Chunk Size | 500-1000 characters is typical for balance |
| Chunk Overlap | 100-200 characters to keep context between chunks |
| Model Token Limit | Keep chunk size + overlap below model max tokens |
| Text Type | Use smaller chunks for detailed text, larger for simple text |
| Test & Adjust | Experiment with chunk sizes for best retrieval results |