LangchainHow-ToBeginner · 3 min read

How to Use RecursiveCharacterTextSplitter in Langchain

Use RecursiveCharacterTextSplitter in Langchain by creating an instance with desired chunk size and overlap, then call split_text or split_documents to break large text or documents into smaller pieces. This splitter recursively tries different separators to split text cleanly without cutting sentences abruptly.

📐

Syntax

The RecursiveCharacterTextSplitter class is initialized with parameters like chunk_size and chunk_overlap. You then use its split_text(text) method to split a long string or split_documents(documents) for document objects.

Key parts:

chunk_size: Maximum size of each chunk.
chunk_overlap: Number of characters to overlap between chunks for context.
separators: List of characters or strings to try splitting on, from largest to smallest.

python

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    separators=["\n\n", "\n", " "]
)

chunks = splitter.split_text(long_text)

💻

Example

This example shows how to split a long text into smaller chunks using RecursiveCharacterTextSplitter. It prints each chunk and its length.

python

from langchain.text_splitter import RecursiveCharacterTextSplitter

text = ("Langchain helps you build applications with language models. "
        "Sometimes texts are too long to process at once, so splitting them is useful. "
        "RecursiveCharacterTextSplitter splits text by trying different separators like paragraphs, lines, and spaces. "
        "It keeps chunks under a set size and overlaps for context.")

splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=10)
chunks = splitter.split_text(text)

for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i} (length {len(chunk)}): {chunk}")

Output

Chunk 1 (length 50): Langchain helps you build applications with language Chunk 2 (length 50): models. Sometimes texts are too long to process at once, Chunk 3 (length 50): so splitting them is useful. RecursiveCharacterTextSplitter Chunk 4 (length 50): splits text by trying different separators like paragraphs, Chunk 5 (length 50): lines, and spaces. It keeps chunks under a set size and Chunk 6 (length 50): overlaps for context.

⚠️

Common Pitfalls

Common mistakes when using RecursiveCharacterTextSplitter include:

Setting chunk_size too small, causing too many tiny chunks.
Setting chunk_overlap larger than chunk_size, which causes errors.
Not providing appropriate separators, leading to awkward splits.
Using split_text on document objects instead of split_documents.

Always check your chunk sizes and overlaps to ensure they make sense.

python

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Wrong: overlap larger than chunk size
try:
    splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=60)
except ValueError as e:
    print(f"Error: {e}")

# Right: overlap smaller than chunk size
splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=10)

Output

Error: chunk_overlap must be smaller than chunk_size

📊

Quick Reference

Tips for using RecursiveCharacterTextSplitter effectively:

Choose chunk_size based on your model's max input length.
Use chunk_overlap to keep context between chunks, typically 10-20% of chunk size.
Customize separators to match your text structure (paragraphs, lines, spaces).
Use split_text for plain strings and split_documents for Langchain Document objects.

✅

Key Takeaways

Use RecursiveCharacterTextSplitter to split long texts into manageable chunks with overlap for context.

Set chunk_size larger than chunk_overlap to avoid errors and too many small chunks.

Customize separators to split text naturally by paragraphs, lines, or spaces.

Use split_text for strings and split_documents for document objects.

Test chunk outputs to ensure splits make sense for your application.