LangChainframework~10 mins

Semantic chunking strategies in LangChain - Step-by-Step Execution

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Perf

Concept Flow - Semantic chunking strategies

Input Text Document

↓

Split Text into Chunks

↓

Apply Semantic Embeddings

↓

Compare Chunk Similarities

↓

Merge or Adjust Chunks Based on Similarity Threshold

↓

Output Optimized Semantic Chunks

The process starts with input text, splits it into chunks, applies semantic embeddings to understand meaning, compares chunks for similarity, merges or adjusts them, and outputs optimized chunks.

Execution Sample

LangChain

from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)
chunks = splitter.split_text(text)
# Then embed and merge semantically similar chunks

This code splits text into overlapping chunks, preparing for semantic embedding and merging.

Execution Table

Step	Action	Input/State	Output/State	Notes
1	Receive input text	Full document text	Text ready for splitting	Start with raw text
2	Split text into chunks	Full text	List of text chunks (each ~100 chars, 20 overlap)	Chunks created with overlap
3	Generate embeddings	Text chunks	Vector embeddings for each chunk	Semantic meaning captured numerically
4	Calculate similarity	Embeddings	Similarity scores between chunks	Measure semantic closeness
5	Merge similar chunks	Chunks + similarity scores	Optimized chunk list	Combine highly similar chunks
6	Output chunks	Optimized chunks	Final semantic chunks	Ready for downstream tasks
7	End	N/A	Process complete	All chunks semantically optimized

💡 All chunks processed and merged based on semantic similarity threshold

Variable Tracker

Variable	Start	After Step 2	After Step 3	After Step 5	Final
text	Full document text	Same text	Same text	Same text	Same text
chunks	N/A	List of text chunks	Same list	Merged chunk list	Final chunk list
embeddings	N/A	N/A	List of vectors	Same list	Same list
similarity_scores	N/A	N/A	N/A	Matrix of scores	Used for merging

Key Moments - 3 Insights

Why do we split text into overlapping chunks instead of non-overlapping?

How do embeddings help in merging chunks?

What happens if similarity threshold is too low or too high?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution_table, what is the output after step 2?

AList of overlapping text chunks

BVector embeddings of text

CSimilarity scores matrix

DFinal merged chunks

Concept Snapshot

Semantic chunking splits text into overlapping pieces,
then converts chunks into vectors (embeddings) to capture meaning.
Chunks with high semantic similarity are merged to optimize size.
This helps keep context and improves downstream processing.
Overlap preserves context; embeddings guide merging.
Final chunks balance detail and size for best results.

Full Transcript

Semantic chunking strategies in Langchain start by taking a full text document and splitting it into smaller overlapping chunks to preserve context. Each chunk is then converted into a vector embedding that captures its semantic meaning. These embeddings are compared to find similarity scores between chunks. Based on these scores, chunks that are very similar are merged to create optimized semantic chunks. This process ensures that the chunks maintain meaningful context and are sized well for further tasks like search or summarization. Overlapping chunks help keep context across boundaries, and embeddings allow us to measure meaning beyond just text. The final output is a list of semantically optimized chunks ready for use.