Chunk size controls how much text is grouped together when searching. It affects how well the system finds and understands information.
Why chunk size affects retrieval quality in LangChain
from langchain.text_splitter import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100) chunks = text_splitter.split_text(long_text)
chunk_size sets how many characters each chunk contains.
chunk_overlap helps keep context by repeating some text between chunks.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50) chunks = text_splitter.split_text(long_text)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=0) chunks = text_splitter.split_text(long_text)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) chunks = text_splitter.split_text(long_text)
This program shows how different chunk sizes split the same text differently. Smaller chunks create more pieces with less text each. Larger chunks create fewer pieces but each contains more text.
This affects how well a search or retrieval system can find and understand information.
from langchain.text_splitter import RecursiveCharacterTextSplitter long_text = ( "This is a long text document that we want to split into chunks. " "Each chunk will be used to help retrieve information more accurately. " "If chunks are too big, the search might miss details. " "If chunks are too small, the search might lose context. " "Finding the right chunk size is important for good results." ) # Split with small chunk size small_chunk_splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=10) small_chunks = small_chunk_splitter.split_text(long_text) # Split with large chunk size large_chunk_splitter = RecursiveCharacterTextSplitter(chunk_size=150, chunk_overlap=10) large_chunks = large_chunk_splitter.split_text(long_text) print("Small chunks (chunk_size=50):") for i, chunk in enumerate(small_chunks, 1): print(f"Chunk {i}: {chunk}") print("\nLarge chunks (chunk_size=150):") for i, chunk in enumerate(large_chunks, 1): print(f"Chunk {i}: {chunk}")
Smaller chunk sizes give more precise search results but increase the number of chunks to process.
Larger chunk sizes keep more context but may include irrelevant information, reducing accuracy.
Choosing chunk size depends on your document length and the detail level you want in retrieval.
Chunk size controls how text is split for searching.
Small chunks improve detail but increase processing.
Large chunks keep context but may reduce precision.