What is Token-based splitting in LangChain?

LangChainframework~5 mins

Token-based splitting in LangChain

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Perf

Introduction

Token-based splitting helps break text into smaller parts based on tokens, which are pieces of words or characters. This makes it easier to handle large texts in manageable chunks.

When you want to process long documents without losing context.

When you need to feed text into language models that have token limits.

When you want to split text more precisely than by sentences or paragraphs.

When you want to keep track of token counts for cost or performance reasons.

Syntax

LangChain

from langchain.text_splitter import TokenTextSplitter

splitter = TokenTextSplitter(chunk_size=100, chunk_overlap=20)
chunks = splitter.split_text(text)

chunk_size sets how many tokens each chunk should have.

chunk_overlap sets how many tokens overlap between chunks to keep context.

Examples

This splits the text into chunks of 50 tokens each, with 10 tokens overlapping between chunks.

LangChain

from langchain.text_splitter import TokenTextSplitter

splitter = TokenTextSplitter(chunk_size=50, chunk_overlap=10)
chunks = splitter.split_text("This is a long text that needs splitting into tokens.")

This splits text into chunks of 200 tokens without any overlap.

LangChain

from langchain.text_splitter import TokenTextSplitter

splitter = TokenTextSplitter(chunk_size=200, chunk_overlap=0)
chunks = splitter.split_text(text)

Sample Program

This example splits a short text into chunks of 10 tokens each, with 3 tokens overlapping. It prints each chunk so you can see how the text is divided.

LangChain

from langchain.text_splitter import TokenTextSplitter

text = "Langchain helps you build applications with language models. Token-based splitting breaks text into token chunks for better processing."

splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=3)
chunks = splitter.split_text(text)

for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i}: {chunk}")

OutputSuccess

Important Notes

Token splitting depends on the tokenizer used by the language model.

Overlap helps keep context but increases total tokens processed.

Adjust chunk size and overlap based on your model's token limits and needs.

Summary

Token-based splitting breaks text into token-sized chunks for easier processing.

Use chunk size and overlap to control chunk length and context.

This method is useful when working with language models that have token limits.