Why do we use text splitters when preparing data for language models?
Think about the input size limits of language models.
Text splitters help break long documents into smaller pieces so they can be processed by models that have input size limits.
What is the output of this code that splits text into chunks of 5 characters?
text = 'HelloWorld!' chunks = [text[i:i+5] for i in range(0, len(text), 5)] print(chunks)
Look at how slicing works with step size 5.
The code slices the string every 5 characters, so it produces ['Hello', 'World', '!'].
You want to split a long document into chunks that keep sentences intact for semantic search. Which splitter is best?
Think about preserving meaning in chunks.
Sentence splitters keep sentences whole, preserving meaning better for semantic search than arbitrary character or word splits.
What is the main effect of increasing chunk overlap when splitting text?
Overlap means repeating some text in adjacent chunks.
Increasing overlap repeats some text in chunks, which helps keep context but adds redundancy.
What error does this recursive text splitter code raise?
def recursive_split(text, max_len): if len(text) <= max_len: return [text] else: split_point = text.rfind('.', 0, max_len) if split_point == -1: split_point = max_len return [text[:split_point]] + recursive_split(text[split_point:], max_len) chunks = recursive_split('This is a sentence. This is another sentence.', 10) print(chunks)
Check what happens when split_point equals max_len and slicing text[split_point:] repeatedly.
If no period is found, split_point equals max_len, so text[split_point:] does not shorten enough, causing infinite recursion and RecursionError.