0
0
LangChainframework~5 mins

RecursiveCharacterTextSplitter in LangChain

Choose your learning style9 modes available
Introduction

The RecursiveCharacterTextSplitter helps break long text into smaller pieces. It does this by splitting text step-by-step using different characters, making sure the pieces are easy to handle.

When you have a long document and want to split it into smaller parts for easier processing.
When you want to keep sentences or paragraphs intact while splitting text.
When you need to prepare text for language models that have input size limits.
When you want to split text by paragraphs first, then by sentences, and finally by words if needed.
When you want to avoid cutting text in the middle of words or sentences.
Syntax
LangChain
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""]
)

chunks = text_splitter.split_text(long_text)

The chunk_size sets the maximum size of each text piece.

The chunk_overlap controls how much text overlaps between chunks to keep context.

Examples
Splits text into chunks of 500 characters with 50 characters overlapping.
LangChain
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.split_text(long_text)
Splits text by new lines, then sentences, then spaces without overlap.
LangChain
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=0,
    separators=["\n", ".", " "]
)
chunks = text_splitter.split_text(long_text)
Splits text by paragraphs first, then by sentences, with some overlap.
LangChain
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100, separators=["\n\n", "."])
chunks = text_splitter.split_text(long_text)
Sample Program

This program splits a text with three paragraphs into smaller chunks of max 50 characters. It overlaps 10 characters between chunks to keep context. It tries to split by paragraphs first, then sentences, then words.

LangChain
from langchain.text_splitter import RecursiveCharacterTextSplitter

long_text = (
    "This is the first paragraph. It has two sentences.\n\n"
    "Here is the second paragraph! It also has sentences? Yes, it does.\n\n"
    "Finally, the third paragraph is here."
)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=50,
    chunk_overlap=10,
    separators=["\n\n", ".", "!", "?", ",", " ", ""]
)

chunks = text_splitter.split_text(long_text)

print("Number of chunks:", len(chunks))
for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i}:", repr(chunk))
OutputSuccess
Important Notes

The splitting tries separators in order, so order matters for best results.

Time complexity depends on text length and number of separators but is generally efficient for normal documents.

Common mistake: setting chunk_size too small can create many tiny chunks.

Summary

RecursiveCharacterTextSplitter breaks text into manageable chunks using multiple separators.

It keeps context by overlapping parts of chunks.

Useful for preparing text for language models or any tool with input size limits.