0
0
LangchainHow-ToBeginner ยท 3 min read

How to Use RecursiveCharacterTextSplitter in Langchain

Use RecursiveCharacterTextSplitter in Langchain by creating an instance with desired chunk size and overlap, then call split_text or split_documents to break large text or documents into smaller pieces. This splitter recursively tries different separators to split text cleanly without cutting sentences abruptly.
๐Ÿ“

Syntax

The RecursiveCharacterTextSplitter class is initialized with parameters like chunk_size and chunk_overlap. You then use its split_text(text) method to split a long string or split_documents(documents) for document objects.

Key parts:

  • chunk_size: Maximum size of each chunk.
  • chunk_overlap: Number of characters to overlap between chunks for context.
  • separators: List of characters or strings to try splitting on, from largest to smallest.
python
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    separators=["\n\n", "\n", " "]
)

chunks = splitter.split_text(long_text)
๐Ÿ’ป

Example

This example shows how to split a long text into smaller chunks using RecursiveCharacterTextSplitter. It prints each chunk and its length.

python
from langchain.text_splitter import RecursiveCharacterTextSplitter

text = ("Langchain helps you build applications with language models. "
        "Sometimes texts are too long to process at once, so splitting them is useful. "
        "RecursiveCharacterTextSplitter splits text by trying different separators like paragraphs, lines, and spaces. "
        "It keeps chunks under a set size and overlaps for context.")

splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=10)
chunks = splitter.split_text(text)

for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i} (length {len(chunk)}): {chunk}")
Output
Chunk 1 (length 50): Langchain helps you build applications with language Chunk 2 (length 50): models. Sometimes texts are too long to process at once, Chunk 3 (length 50): so splitting them is useful. RecursiveCharacterTextSplitter Chunk 4 (length 50): splits text by trying different separators like paragraphs, Chunk 5 (length 50): lines, and spaces. It keeps chunks under a set size and Chunk 6 (length 50): overlaps for context.
โš ๏ธ

Common Pitfalls

Common mistakes when using RecursiveCharacterTextSplitter include:

  • Setting chunk_size too small, causing too many tiny chunks.
  • Setting chunk_overlap larger than chunk_size, which causes errors.
  • Not providing appropriate separators, leading to awkward splits.
  • Using split_text on document objects instead of split_documents.

Always check your chunk sizes and overlaps to ensure they make sense.

python
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Wrong: overlap larger than chunk size
try:
    splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=60)
except ValueError as e:
    print(f"Error: {e}")

# Right: overlap smaller than chunk size
splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=10)
Output
Error: chunk_overlap must be smaller than chunk_size
๐Ÿ“Š

Quick Reference

Tips for using RecursiveCharacterTextSplitter effectively:

  • Choose chunk_size based on your model's max input length.
  • Use chunk_overlap to keep context between chunks, typically 10-20% of chunk size.
  • Customize separators to match your text structure (paragraphs, lines, spaces).
  • Use split_text for plain strings and split_documents for Langchain Document objects.
โœ…

Key Takeaways

Use RecursiveCharacterTextSplitter to split long texts into manageable chunks with overlap for context.
Set chunk_size larger than chunk_overlap to avoid errors and too many small chunks.
Customize separators to split text naturally by paragraphs, lines, or spaces.
Use split_text for strings and split_documents for document objects.
Test chunk outputs to ensure splits make sense for your application.