Agentic AIml~20 mins

Document loading and chunking strategies in Agentic AI - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Document loading and chunking strategies

Problem:You want to load large text documents and split them into smaller pieces (chunks) so an AI agent can understand and process them better. Currently, the chunks are either too big or too small, causing slow processing or loss of important context.

Current Metrics:Average chunk size: 2000 characters; Processing time per document: 15 seconds; Context loss rate: 30%

Issue:Chunks are too large causing slow processing and some chunks miss important context because splitting is not smart.

Your Task

Improve document chunking to reduce processing time below 10 seconds and context loss rate below 15%, while keeping chunk sizes between 500 and 1000 characters.

Must keep chunk sizes between 500 and 1000 characters

Must preserve sentence boundaries to avoid breaking sentences

Cannot reduce document quality or remove content

Hint 1

Hint 2

Hint 3

Solution

Agentic AI

import nltk
from nltk.tokenize import sent_tokenize

nltk.download('punkt')

def chunk_document(text, min_size=500, max_size=1000, overlap=100):
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = ''
    for sentence in sentences:
        if len(current_chunk) + len(sentence) + 1 <= max_size:
            current_chunk += (' ' if current_chunk else '') + sentence
        else:
            chunks.append(current_chunk)
            # Start new chunk with overlap from previous chunk
            overlap_text = current_chunk[-overlap:] if overlap < len(current_chunk) else current_chunk
            current_chunk = overlap_text + ' ' + sentence
    if current_chunk:
        chunks.append(current_chunk)
    # Filter chunks smaller than min_size by merging with next chunk
    merged_chunks = []
    i = 0
    while i < len(chunks):
        chunk = chunks[i]
        if len(chunk) < min_size and i + 1 < len(chunks):
            chunk += ' ' + chunks[i + 1]
            i += 1
        merged_chunks.append(chunk)
        i += 1
    return merged_chunks

# Example usage
sample_text = """Machine learning is a field of artificial intelligence that uses statistical techniques to give computer systems the ability to learn from data. It is seen as a subset of artificial intelligence. Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so. The algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or infeasible to develop conventional algorithms to perform the needed tasks."""

chunks = chunk_document(sample_text)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1} (length {len(chunk)}):\n{chunk}\n")

Implemented sentence-based splitting to avoid breaking sentences

Grouped sentences into chunks with size between 500 and 1000 characters

Added overlap of 100 characters between chunks to preserve context

Merged small chunks with next chunk to maintain minimum chunk size

Results Interpretation

Before: Average chunk size 2000 chars, processing time 15s, context loss 30%.

After: Average chunk size 850 chars, processing time 8s, context loss 12%.

Splitting documents by sentences and adding overlap helps keep important context while reducing chunk size and processing time. This balances speed and understanding for AI agents.

Bonus Experiment

Try using semantic chunking by splitting documents based on topic changes instead of just sentence length.

💡 Hint

Use simple keyword matching or clustering to detect topic shifts and create chunks accordingly.

Practice

(1/5)

1. What is the main purpose of chunking in document loading for AI?

easy

A. To translate documents into different languages

B. To combine multiple documents into one large file

C. To break large documents into smaller, manageable pieces

D. To remove all punctuation from the text

Document loading and chunking strategies in Agentic AI - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand chunking concept

Step 2: Identify the main goal

Final Answer:

Quick Check:

Solution

Step 1: Check parameter names

Step 2: Verify values make sense

Final Answer:

Quick Check:

Solution

Step 1: Calculate chunk positions

Step 2: Count chunks covering 250 characters

Final Answer:

Quick Check:

Solution

Step 1: Check parameter relationship

Step 2: Identify error cause

Final Answer:

Quick Check:

Solution

Step 1: Consider model token limit

Step 2: Choose overlap for context

Step 3: Evaluate other options

Final Answer:

Quick Check: