Semantic chunking helps break large texts into meaningful parts. This makes it easier for language models to understand and process information.
0
0
Semantic chunking strategies in LangChain
Introduction
When you have a long document and want to split it into smaller, related sections.
When preparing text data for a chatbot to answer questions accurately.
When indexing documents for faster and smarter search results.
When you want to keep related ideas together instead of splitting randomly.
When improving the quality of embeddings for better semantic search.
Syntax
LangChain
from langchain.text_splitter import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, separators=["\n\n", "\n", " ", "", ""] ) chunks = text_splitter.split_text(long_text)
The chunk_size sets how big each chunk can be.
The chunk_overlap keeps some shared text between chunks to maintain context.
Examples
This splits text into chunks of 500 characters with 50 characters overlapping.
LangChain
from langchain.text_splitter import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50) chunks = text_splitter.split_text(long_text)
This splits text by new lines or spaces without overlap.
LangChain
from langchain.text_splitter import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=0, separators=["\n", " "] ) chunks = text_splitter.split_text(long_text)
Sample Program
This example splits a short text into chunks of 50 characters with 10 characters overlapping. It prints each chunk to show how the text is divided.
LangChain
from langchain.text_splitter import RecursiveCharacterTextSplitter long_text = ( "Langchain helps you build applications with language models. " "It is useful for chatbots, semantic search, and more. " "Semantic chunking breaks text into meaningful parts. " "This improves understanding and retrieval." ) text_splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=10) chunks = text_splitter.split_text(long_text) for i, chunk in enumerate(chunks, 1): print(f"Chunk {i}: {chunk}")
OutputSuccess
Important Notes
Overlap helps keep context between chunks but increases total size.
Choosing good separators like paragraphs or sentences keeps chunks meaningful.
Test different chunk sizes to find what works best for your data and model.
Summary
Semantic chunking splits text into meaningful parts for better language model use.
Use chunk size and overlap to control chunk length and context.
Good separators keep chunks related and easier to understand.