0
0
LangChainframework~5 mins

Why chunk size affects retrieval quality in LangChain

Choose your learning style9 modes available
Introduction

Chunk size controls how much text is grouped together when searching. It affects how well the system finds and understands information.

When you want to split a long document into smaller parts for searching.
When you notice search results are too broad or too narrow.
When you want to improve the accuracy of answers from a document.
When working with documents that have sections of different lengths.
When optimizing performance for faster retrieval.
Syntax
LangChain
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = text_splitter.split_text(long_text)

chunk_size sets how many characters each chunk contains.

chunk_overlap helps keep context by repeating some text between chunks.

Examples
Smaller chunks with some overlap keep context but create more pieces.
LangChain
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.split_text(long_text)
Larger chunks with no overlap reduce pieces but may lose some context between chunks.
LangChain
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
chunks = text_splitter.split_text(long_text)
Medium chunk size with more overlap balances context and chunk count.
LangChain
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_text(long_text)
Sample Program

This program shows how different chunk sizes split the same text differently. Smaller chunks create more pieces with less text each. Larger chunks create fewer pieces but each contains more text.

This affects how well a search or retrieval system can find and understand information.

LangChain
from langchain.text_splitter import RecursiveCharacterTextSplitter

long_text = (
    "This is a long text document that we want to split into chunks. "
    "Each chunk will be used to help retrieve information more accurately. "
    "If chunks are too big, the search might miss details. "
    "If chunks are too small, the search might lose context. "
    "Finding the right chunk size is important for good results."
)

# Split with small chunk size
small_chunk_splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=10)
small_chunks = small_chunk_splitter.split_text(long_text)

# Split with large chunk size
large_chunk_splitter = RecursiveCharacterTextSplitter(chunk_size=150, chunk_overlap=10)
large_chunks = large_chunk_splitter.split_text(long_text)

print("Small chunks (chunk_size=50):")
for i, chunk in enumerate(small_chunks, 1):
    print(f"Chunk {i}: {chunk}")

print("\nLarge chunks (chunk_size=150):")
for i, chunk in enumerate(large_chunks, 1):
    print(f"Chunk {i}: {chunk}")
OutputSuccess
Important Notes

Smaller chunk sizes give more precise search results but increase the number of chunks to process.

Larger chunk sizes keep more context but may include irrelevant information, reducing accuracy.

Choosing chunk size depends on your document length and the detail level you want in retrieval.

Summary

Chunk size controls how text is split for searching.

Small chunks improve detail but increase processing.

Large chunks keep context but may reduce precision.