How to Use Text Splitter in Langchain: Simple Guide
In Langchain, use the
RecursiveCharacterTextSplitter or other text splitter classes to divide large text into smaller chunks. Initialize the splitter with parameters like chunk_size and chunk_overlap, then call split_text() to get the split pieces.Syntax
The basic syntax involves importing a text splitter class like RecursiveCharacterTextSplitter, creating an instance with settings, and calling split_text() on your input text.
- chunk_size: Maximum size of each text chunk.
- chunk_overlap: Number of characters to overlap between chunks for context.
- split_text(text): Method to split the input text into chunks.
python
from langchain.text_splitter import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) chunks = splitter.split_text("Your long text here...")
Example
This example shows how to split a long paragraph into smaller chunks using RecursiveCharacterTextSplitter. It prints each chunk separately.
python
from langchain.text_splitter import RecursiveCharacterTextSplitter text = ("Langchain helps you build applications with language models. " "Sometimes texts are too long to process at once, so splitting them helps. " "This splitter breaks text into chunks with overlap for better context.") splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=10) chunks = splitter.split_text(text) for i, chunk in enumerate(chunks, 1): print(f"Chunk {i}: {chunk}")
Output
Chunk 1: Langchain helps you build applications with language
Chunk 2: applications with language models. Sometimes texts are too
Chunk 3: texts are too long to process at once, so splitting them
Chunk 4: splitting them helps. This splitter breaks text into chunks
Chunk 5: chunks with overlap for better context.
Common Pitfalls
Common mistakes include setting chunk_size too small or too large, which can cause too many or too few chunks. Forgetting to set chunk_overlap can lose context between chunks. Also, using the wrong splitter class for your text type may reduce effectiveness.
Always test your splitter settings with sample text to find the best balance.
python
from langchain.text_splitter import RecursiveCharacterTextSplitter # Wrong: chunk_size too small, no overlap splitter_wrong = RecursiveCharacterTextSplitter(chunk_size=10, chunk_overlap=0) chunks_wrong = splitter_wrong.split_text("This is a sample text to split.") # Right: reasonable chunk_size and overlap splitter_right = RecursiveCharacterTextSplitter(chunk_size=20, chunk_overlap=5) chunks_right = splitter_right.split_text("This is a sample text to split.")
Quick Reference
| Parameter | Description | Example Value |
|---|---|---|
| chunk_size | Max characters per chunk | 1000 |
| chunk_overlap | Characters to repeat between chunks | 200 |
| split_text(text) | Method to split input text | splitter.split_text(text) |
| RecursiveCharacterTextSplitter | Common splitter class for general text | Used in examples |
Key Takeaways
Use RecursiveCharacterTextSplitter with chunk_size and chunk_overlap to split text effectively.
Set chunk_overlap to keep context between chunks and improve downstream processing.
Test splitter settings on sample text to avoid too many or too few chunks.
Choose the right splitter class based on your text type and use case.