0
0
LangchainHow-ToBeginner ยท 3 min read

How to Split Documents in LangChain: Simple Guide

In LangChain, you split documents using TextSplitter classes like RecursiveCharacterTextSplitter which break text into smaller chunks. You create a splitter instance, then call split_text or split_documents to get manageable pieces for processing.
๐Ÿ“

Syntax

LangChain provides several TextSplitter classes to divide large documents into smaller chunks. The common pattern is:

  • Create a splitter instance, e.g., RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200).
  • Use split_text(text) to split a plain string or split_documents(documents) to split a list of document objects.

This helps manage large texts for tasks like embeddings or question answering.

python
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

# To split plain text
chunks = splitter.split_text("Your long text here...")

# To split LangChain Document objects
# chunks = splitter.split_documents(list_of_documents)
๐Ÿ’ป

Example

This example shows how to split a long text into smaller chunks using RecursiveCharacterTextSplitter. It prints each chunk so you can see the split parts.

python
from langchain.text_splitter import RecursiveCharacterTextSplitter

text = ("LangChain helps you build applications with language models. "
        "Sometimes documents are too long to process at once, so splitting them into chunks is useful. "
        "This example splits the text into smaller pieces with some overlap.")

splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=10)
chunks = splitter.split_text(text)

for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i}:", chunk)
Output
Chunk 1: LangChain helps you build applications with language Chunk 2: applications with language models. Sometimes documents are Chunk 3: Sometimes documents are too long to process at once, so Chunk 4: at once, so splitting them into chunks is useful. This Chunk 5: This example splits the text into smaller pieces with some Chunk 6: with some overlap.
โš ๏ธ

Common Pitfalls

Common mistakes when splitting documents in LangChain include:

  • Setting chunk_size too small or too large, which can cause too many or too few chunks.
  • Not using chunk_overlap, which can lose context between chunks.
  • Trying to split non-text data without converting it to string first.
  • Using split_text on Document objects instead of split_documents.

Always choose chunk sizes based on your model's token limits and task needs.

python
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Wrong: Using split_text on Document objects
# chunks = splitter.split_text(list_of_documents)  # This will error

# Right: Use split_documents for Document objects
# chunks = splitter.split_documents(list_of_documents)
๐Ÿ“Š

Quick Reference

MethodDescription
RecursiveCharacterTextSplitter(chunk_size, chunk_overlap)Splits text by characters with overlap for context.
split_text(text: str)Splits a plain string into chunks.
split_documents(documents: List[Document])Splits a list of Document objects into smaller Document chunks.
chunk_sizeMaximum size of each chunk in characters.
chunk_overlapNumber of characters to overlap between chunks to keep context.
โœ…

Key Takeaways

Use LangChain's TextSplitter classes like RecursiveCharacterTextSplitter to split long texts.
Set chunk_size and chunk_overlap thoughtfully to balance chunk size and context.
Use split_text for plain strings and split_documents for Document objects.
Avoid too small chunk sizes to prevent too many fragments.
Splitting helps manage large documents for language model processing.