0
0
LangChainframework~10 mins

Token-based splitting in LangChain - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Token-based splitting
Input Text
Initialize Tokenizer
Split Text into Tokens
Group Tokens into Chunks
Return List of Text Chunks
The text is first tokenized, then tokens are grouped into chunks, and finally the chunks are returned as split text.
Execution Sample
LangChain
from langchain.text_splitter import TokenTextSplitter

text = "Hello world! This is a test of token splitting."
splitter = TokenTextSplitter(chunk_size=5)
chunks = splitter.split_text(text)
print(chunks)
This code splits the input text into chunks of 5 tokens each using Langchain's TokenTextSplitter.
Execution Table
StepActionTokens ProcessedCurrent Chunk TokensChunks Formed
1Initialize tokenizer and input text0
2Tokenize text0
3Add token 'Hello' to current chunk1Hello
4Add token 'world' to current chunk2Hello, world
5Add token '!' to current chunk3Hello, world, !
6Add token 'This' to current chunk4Hello, world, !, This
7Add token 'is' to current chunk5Hello, world, !, This, is
8Current chunk reached chunk_size=5, save chunk5Hello world! This is
9Add token 'a' to new chunk6aHello world! This is
10Add token 'test' to current chunk7a, testHello world! This is
11Add token 'of' to current chunk8a, test, ofHello world! This is
12Add token 'token' to current chunk9a, test, of, tokenHello world! This is
13Add token 'splitting' to current chunk10a, test, of, token, splittingHello world! This is
14Current chunk reached chunk_size=5, save chunk10Hello world! This is, a test of token splitting
15Add token '.' to new chunk11.Hello world! This is, a test of token splitting
16End of tokens, save last chunk11Hello world! This is, a test of token splitting, .
17Return all chunks11Hello world! This is, a test of token splitting, .
💡 All tokens processed and chunks returned.
Variable Tracker
VariableStartAfter Step 8After Step 14After Step 16Final
tokens_processed05101111
current_chunk_tokens
chunks_formedHello world! This isHello world! This is, a test of token splittingHello world! This is, a test of token splitting, .Hello world! This is, a test of token splitting, .
Key Moments - 2 Insights
Why does the current chunk reset after reaching chunk_size?
Because once the chunk reaches the specified token limit (chunk_size=5), it is saved to the chunks list and the current chunk is cleared to start collecting the next group of tokens. See execution_table rows 7-8 and 13-14.
What happens to leftover tokens that don't fill a full chunk?
Leftover tokens form a smaller chunk at the end. In this example, the last token '.' forms its own chunk as shown in execution_table rows 15-16.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table at step 7. How many tokens are in the current chunk?
A4
B3
C5
D6
💡 Hint
Check the 'Current Chunk Tokens' column at step 7 in the execution_table.
At which step does the first chunk get saved to the chunks list?
AStep 8
BStep 10
CStep 5
DStep 14
💡 Hint
Look for when 'Chunks Formed' first contains a chunk in the execution_table.
If chunk_size was changed to 3, how would the number of chunks change?
AFewer chunks would be created
BMore chunks would be created
CNumber of chunks stays the same
DNo chunks would be created
💡 Hint
Smaller chunk_size means smaller groups, so more chunks overall. Refer to chunk_size effect in concept_flow.
Concept Snapshot
Token-based splitting:
- Input text is split into tokens.
- Tokens are grouped into chunks of fixed size.
- Each chunk is joined back to text.
- Leftover tokens form a smaller last chunk.
- Useful for processing text in manageable pieces.
Full Transcript
Token-based splitting takes a long text and breaks it into smaller parts based on tokens. First, the text is split into tokens using a tokenizer. Then tokens are collected into groups called chunks, each with a fixed number of tokens. When a chunk reaches the set size, it is saved and a new chunk starts. If tokens remain at the end that don't fill a full chunk, they form a smaller chunk. This method helps handle large texts by splitting them into smaller, manageable pieces for processing.