How to Use RecursiveCharacterTextSplitter in Langchain
Use
RecursiveCharacterTextSplitter in Langchain by creating an instance with desired chunk size and overlap, then call split_text or split_documents to break large text or documents into smaller pieces. This splitter recursively tries different separators to split text cleanly without cutting sentences abruptly.Syntax
The RecursiveCharacterTextSplitter class is initialized with parameters like chunk_size and chunk_overlap. You then use its split_text(text) method to split a long string or split_documents(documents) for document objects.
Key parts:
chunk_size: Maximum size of each chunk.chunk_overlap: Number of characters to overlap between chunks for context.separators: List of characters or strings to try splitting on, from largest to smallest.
python
from langchain.text_splitter import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=100, separators=["\n\n", "\n", " "] ) chunks = splitter.split_text(long_text)
Example
This example shows how to split a long text into smaller chunks using RecursiveCharacterTextSplitter. It prints each chunk and its length.
python
from langchain.text_splitter import RecursiveCharacterTextSplitter text = ("Langchain helps you build applications with language models. " "Sometimes texts are too long to process at once, so splitting them is useful. " "RecursiveCharacterTextSplitter splits text by trying different separators like paragraphs, lines, and spaces. " "It keeps chunks under a set size and overlaps for context.") splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=10) chunks = splitter.split_text(text) for i, chunk in enumerate(chunks, 1): print(f"Chunk {i} (length {len(chunk)}): {chunk}")
Output
Chunk 1 (length 50): Langchain helps you build applications with language
Chunk 2 (length 50): models. Sometimes texts are too long to process at once,
Chunk 3 (length 50): so splitting them is useful. RecursiveCharacterTextSplitter
Chunk 4 (length 50): splits text by trying different separators like paragraphs,
Chunk 5 (length 50): lines, and spaces. It keeps chunks under a set size and
Chunk 6 (length 50): overlaps for context.
Common Pitfalls
Common mistakes when using RecursiveCharacterTextSplitter include:
- Setting
chunk_sizetoo small, causing too many tiny chunks. - Setting
chunk_overlaplarger thanchunk_size, which causes errors. - Not providing appropriate
separators, leading to awkward splits. - Using
split_texton document objects instead ofsplit_documents.
Always check your chunk sizes and overlaps to ensure they make sense.
python
from langchain.text_splitter import RecursiveCharacterTextSplitter # Wrong: overlap larger than chunk size try: splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=60) except ValueError as e: print(f"Error: {e}") # Right: overlap smaller than chunk size splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=10)
Output
Error: chunk_overlap must be smaller than chunk_size
Quick Reference
Tips for using RecursiveCharacterTextSplitter effectively:
- Choose
chunk_sizebased on your model's max input length. - Use
chunk_overlapto keep context between chunks, typically 10-20% of chunk size. - Customize
separatorsto match your text structure (paragraphs, lines, spaces). - Use
split_textfor plain strings andsplit_documentsfor Langchain Document objects.
Key Takeaways
Use RecursiveCharacterTextSplitter to split long texts into manageable chunks with overlap for context.
Set chunk_size larger than chunk_overlap to avoid errors and too many small chunks.
Customize separators to split text naturally by paragraphs, lines, or spaces.
Use split_text for strings and split_documents for document objects.
Test chunk outputs to ensure splits make sense for your application.