Imagine you have a very long document to analyze with an AI model. Why is it useful to split this document into smaller chunks before processing?
Think about how computers handle large amounts of data and model input limits.
Chunking breaks a large document into smaller pieces so the AI model can process each piece without running out of memory or losing context. This helps keep the analysis accurate and efficient.
What is the output of this Python code that chunks a text into pieces of 5 words?
text = 'Machine learning helps computers learn from data and improve over time' words = text.split() chunks = [' '.join(words[i:i+5]) for i in range(0, len(words), 5)] print(chunks)
Look at how the range steps by 5 and how words are joined.
The code splits the text into words, then groups every 5 words into one chunk. The last chunk may have fewer than 5 words.
You want to create vector embeddings from a large document for a search system. Which chunk size strategy is best to balance context and model limits?
Consider model input size limits and the need for meaningful context.
Moderate chunk sizes keep enough context for meaningful embeddings while respecting model input size limits. Too small loses context; too large exceeds limits.
You test two chunking strategies for document search: small chunks (50 words) and large chunks (500 words). Which metric would best show if chunking size affects search accuracy?
Think about how to measure search quality and relevance.
Recall@k measures how many relevant documents appear in the top k search results, showing how chunking affects retrieval accuracy.
What error does this code raise when trying to create overlapping chunks of size 4 with step 2?
text = 'AI models learn patterns from data to make predictions' words = text.split() chunks = [words[i:i+4] for i in range(0, len(words), 2)] print(chunks[10])
Check how many chunks are created and if index 10 is valid.
The code creates fewer than 11 chunks, so accessing chunks[10] causes an IndexError.