How to use text splitter langchain

LangchainHow-ToBeginner · 4 min read

How to Use Text Splitter in Langchain: Simple Guide

In Langchain, use the RecursiveCharacterTextSplitter or other text splitter classes to divide large text into smaller chunks. Initialize the splitter with parameters like chunk_size and chunk_overlap, then call split_text() to get the split pieces.

📐

Syntax

The basic syntax involves importing a text splitter class like RecursiveCharacterTextSplitter, creating an instance with settings, and calling split_text() on your input text.

chunk_size: Maximum size of each text chunk.
chunk_overlap: Number of characters to overlap between chunks for context.
split_text(text): Method to split the input text into chunks.

python

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_text("Your long text here...")

💻

Example

This example shows how to split a long paragraph into smaller chunks using RecursiveCharacterTextSplitter. It prints each chunk separately.

python

from langchain.text_splitter import RecursiveCharacterTextSplitter

text = ("Langchain helps you build applications with language models. "
        "Sometimes texts are too long to process at once, so splitting them helps. "
        "This splitter breaks text into chunks with overlap for better context.")

splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=10)
chunks = splitter.split_text(text)

for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i}: {chunk}")

Output

Chunk 1: Langchain helps you build applications with language Chunk 2: applications with language models. Sometimes texts are too Chunk 3: texts are too long to process at once, so splitting them Chunk 4: splitting them helps. This splitter breaks text into chunks Chunk 5: chunks with overlap for better context.

⚠️

Common Pitfalls

Common mistakes include setting chunk_size too small or too large, which can cause too many or too few chunks. Forgetting to set chunk_overlap can lose context between chunks. Also, using the wrong splitter class for your text type may reduce effectiveness.

Always test your splitter settings with sample text to find the best balance.

python

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Wrong: chunk_size too small, no overlap
splitter_wrong = RecursiveCharacterTextSplitter(chunk_size=10, chunk_overlap=0)
chunks_wrong = splitter_wrong.split_text("This is a sample text to split.")

# Right: reasonable chunk_size and overlap
splitter_right = RecursiveCharacterTextSplitter(chunk_size=20, chunk_overlap=5)
chunks_right = splitter_right.split_text("This is a sample text to split.")

📊

Quick Reference

Parameter	Description	Example Value
chunk_size	Max characters per chunk	1000
chunk_overlap	Characters to repeat between chunks	200
split_text(text)	Method to split input text	splitter.split_text(text)
RecursiveCharacterTextSplitter	Common splitter class for general text	Used in examples

✅

Key Takeaways

Use RecursiveCharacterTextSplitter with chunk_size and chunk_overlap to split text effectively.

Set chunk_overlap to keep context between chunks and improve downstream processing.

Test splitter settings on sample text to avoid too many or too few chunks.

Choose the right splitter class based on your text type and use case.