How to Split Documents in LangChain: Simple Guide
In LangChain, you split documents using
TextSplitter classes like RecursiveCharacterTextSplitter which break text into smaller chunks. You create a splitter instance, then call split_text or split_documents to get manageable pieces for processing.Syntax
LangChain provides several TextSplitter classes to divide large documents into smaller chunks. The common pattern is:
- Create a splitter instance, e.g.,
RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200). - Use
split_text(text)to split a plain string orsplit_documents(documents)to split a list of document objects.
This helps manage large texts for tasks like embeddings or question answering.
python
from langchain.text_splitter import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) # To split plain text chunks = splitter.split_text("Your long text here...") # To split LangChain Document objects # chunks = splitter.split_documents(list_of_documents)
Example
This example shows how to split a long text into smaller chunks using RecursiveCharacterTextSplitter. It prints each chunk so you can see the split parts.
python
from langchain.text_splitter import RecursiveCharacterTextSplitter text = ("LangChain helps you build applications with language models. " "Sometimes documents are too long to process at once, so splitting them into chunks is useful. " "This example splits the text into smaller pieces with some overlap.") splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=10) chunks = splitter.split_text(text) for i, chunk in enumerate(chunks, 1): print(f"Chunk {i}:", chunk)
Output
Chunk 1: LangChain helps you build applications with language
Chunk 2: applications with language models. Sometimes documents are
Chunk 3: Sometimes documents are too long to process at once, so
Chunk 4: at once, so splitting them into chunks is useful. This
Chunk 5: This example splits the text into smaller pieces with some
Chunk 6: with some overlap.
Common Pitfalls
Common mistakes when splitting documents in LangChain include:
- Setting
chunk_sizetoo small or too large, which can cause too many or too few chunks. - Not using
chunk_overlap, which can lose context between chunks. - Trying to split non-text data without converting it to string first.
- Using
split_texton Document objects instead ofsplit_documents.
Always choose chunk sizes based on your model's token limits and task needs.
python
from langchain.text_splitter import RecursiveCharacterTextSplitter # Wrong: Using split_text on Document objects # chunks = splitter.split_text(list_of_documents) # This will error # Right: Use split_documents for Document objects # chunks = splitter.split_documents(list_of_documents)
Quick Reference
| Method | Description |
|---|---|
| RecursiveCharacterTextSplitter(chunk_size, chunk_overlap) | Splits text by characters with overlap for context. |
| split_text(text: str) | Splits a plain string into chunks. |
| split_documents(documents: List[Document]) | Splits a list of Document objects into smaller Document chunks. |
| chunk_size | Maximum size of each chunk in characters. |
| chunk_overlap | Number of characters to overlap between chunks to keep context. |
Key Takeaways
Use LangChain's TextSplitter classes like RecursiveCharacterTextSplitter to split long texts.
Set chunk_size and chunk_overlap thoughtfully to balance chunk size and context.
Use split_text for plain strings and split_documents for Document objects.
Avoid too small chunk sizes to prevent too many fragments.
Splitting helps manage large documents for language model processing.