Smaller chunks help the retriever find precise matches because they contain focused, relevant information. Very large chunks may mix unrelated content, reducing retrieval accuracy.
from langchain.text_splitter import RecursiveCharacterTextSplitter text = 'Langchain helps build applications with LLMs. It uses chunking to improve retrieval.' splitter_small = RecursiveCharacterTextSplitter(chunk_size=20, chunk_overlap=0) chunks_small = splitter_small.split_text(text) splitter_large = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=0) chunks_large = splitter_large.split_text(text) print(len(chunks_small), len(chunks_large))
Smaller chunk sizes create more, shorter pieces of text. This helps the retriever find exact matches in smaller, focused segments, improving answer precision.
Option D correctly uses integers for chunk size and overlap, with chunk size larger than overlap, which is required for meaningful chunks.
Option D has overlap larger than chunk size, which is invalid.
Option D uses a string instead of integer for chunk size, causing a type error.
Option D sets chunk size to zero, which is invalid.
When chunks are too large, they contain mixed topics or unrelated information. This confuses the retriever, which tries to match the whole chunk, leading to less relevant answers.
Smaller chunks mean more pieces to process and store, increasing ingestion time and storage needs. However, they help the retriever find precise matches, improving the quality of answers generated downstream.
Larger chunks reduce ingestion overhead but can reduce retrieval precision and answer quality.