How to split documents langchain

LangchainHow-ToBeginner · 3 min read

How to Split Documents in LangChain: Simple Guide

In LangChain, you split documents using TextSplitter classes like RecursiveCharacterTextSplitter which break text into smaller chunks. You create a splitter instance, then call split_text or split_documents to get manageable pieces for processing.

📐

Syntax

LangChain provides several TextSplitter classes to divide large documents into smaller chunks. The common pattern is:

Create a splitter instance, e.g., RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200).
Use split_text(text) to split a plain string or split_documents(documents) to split a list of document objects.

This helps manage large texts for tasks like embeddings or question answering.

python

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

# To split plain text
chunks = splitter.split_text("Your long text here...")

# To split LangChain Document objects
# chunks = splitter.split_documents(list_of_documents)

💻

Example

This example shows how to split a long text into smaller chunks using RecursiveCharacterTextSplitter. It prints each chunk so you can see the split parts.

python

from langchain.text_splitter import RecursiveCharacterTextSplitter

text = ("LangChain helps you build applications with language models. "
        "Sometimes documents are too long to process at once, so splitting them into chunks is useful. "
        "This example splits the text into smaller pieces with some overlap.")

splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=10)
chunks = splitter.split_text(text)

for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i}:", chunk)

Output

Chunk 1: LangChain helps you build applications with language Chunk 2: applications with language models. Sometimes documents are Chunk 3: Sometimes documents are too long to process at once, so Chunk 4: at once, so splitting them into chunks is useful. This Chunk 5: This example splits the text into smaller pieces with some Chunk 6: with some overlap.

⚠️

Common Pitfalls

Common mistakes when splitting documents in LangChain include:

Setting chunk_size too small or too large, which can cause too many or too few chunks.
Not using chunk_overlap, which can lose context between chunks.
Trying to split non-text data without converting it to string first.
Using split_text on Document objects instead of split_documents.

Always choose chunk sizes based on your model's token limits and task needs.

python

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Wrong: Using split_text on Document objects
# chunks = splitter.split_text(list_of_documents)  # This will error

# Right: Use split_documents for Document objects
# chunks = splitter.split_documents(list_of_documents)

📊

Quick Reference

Method	Description
RecursiveCharacterTextSplitter(chunk_size, chunk_overlap)	Splits text by characters with overlap for context.
split_text(text: str)	Splits a plain string into chunks.
split_documents(documents: List[Document])	Splits a list of Document objects into smaller Document chunks.
chunk_size	Maximum size of each chunk in characters.
chunk_overlap	Number of characters to overlap between chunks to keep context.

✅

Key Takeaways

Use LangChain's TextSplitter classes like RecursiveCharacterTextSplitter to split long texts.

Set chunk_size and chunk_overlap thoughtfully to balance chunk size and context.

Use split_text for plain strings and split_documents for Document objects.

Avoid too small chunk sizes to prevent too many fragments.

Splitting helps manage large documents for language model processing.