0
0
LangchainHow-ToBeginner ยท 4 min read

How to Use Text Splitter in Langchain: Simple Guide

In Langchain, use the RecursiveCharacterTextSplitter or other text splitter classes to divide large text into smaller chunks. Initialize the splitter with parameters like chunk_size and chunk_overlap, then call split_text() to get the split pieces.
๐Ÿ“

Syntax

The basic syntax involves importing a text splitter class like RecursiveCharacterTextSplitter, creating an instance with settings, and calling split_text() on your input text.

  • chunk_size: Maximum size of each text chunk.
  • chunk_overlap: Number of characters to overlap between chunks for context.
  • split_text(text): Method to split the input text into chunks.
python
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_text("Your long text here...")
๐Ÿ’ป

Example

This example shows how to split a long paragraph into smaller chunks using RecursiveCharacterTextSplitter. It prints each chunk separately.

python
from langchain.text_splitter import RecursiveCharacterTextSplitter

text = ("Langchain helps you build applications with language models. "
        "Sometimes texts are too long to process at once, so splitting them helps. "
        "This splitter breaks text into chunks with overlap for better context.")

splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=10)
chunks = splitter.split_text(text)

for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i}: {chunk}")
Output
Chunk 1: Langchain helps you build applications with language Chunk 2: applications with language models. Sometimes texts are too Chunk 3: texts are too long to process at once, so splitting them Chunk 4: splitting them helps. This splitter breaks text into chunks Chunk 5: chunks with overlap for better context.
โš ๏ธ

Common Pitfalls

Common mistakes include setting chunk_size too small or too large, which can cause too many or too few chunks. Forgetting to set chunk_overlap can lose context between chunks. Also, using the wrong splitter class for your text type may reduce effectiveness.

Always test your splitter settings with sample text to find the best balance.

python
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Wrong: chunk_size too small, no overlap
splitter_wrong = RecursiveCharacterTextSplitter(chunk_size=10, chunk_overlap=0)
chunks_wrong = splitter_wrong.split_text("This is a sample text to split.")

# Right: reasonable chunk_size and overlap
splitter_right = RecursiveCharacterTextSplitter(chunk_size=20, chunk_overlap=5)
chunks_right = splitter_right.split_text("This is a sample text to split.")
๐Ÿ“Š

Quick Reference

ParameterDescriptionExample Value
chunk_sizeMax characters per chunk1000
chunk_overlapCharacters to repeat between chunks200
split_text(text)Method to split input textsplitter.split_text(text)
RecursiveCharacterTextSplitterCommon splitter class for general textUsed in examples
โœ…

Key Takeaways

Use RecursiveCharacterTextSplitter with chunk_size and chunk_overlap to split text effectively.
Set chunk_overlap to keep context between chunks and improve downstream processing.
Test splitter settings on sample text to avoid too many or too few chunks.
Choose the right splitter class based on your text type and use case.