We split big documents into smaller parts to help AI understand and work with them better.
Document loading and chunking strategies in Agentic AI
Start learning this pattern below
Jump into concepts and practice - no test required
loader = DocumentLoader('file_path') documents = loader.load() chunks = Chunker(documents, chunk_size=500, overlap=50).chunk()
DocumentLoader reads the full document from a file or source.
Chunker splits the document into smaller pieces with optional overlap to keep context.
loader = DocumentLoader('report.txt') docs = loader.load() chunker = Chunker(docs, chunk_size=1000, overlap=100) chunks = chunker.chunk()
loader = DocumentLoader('book.pdf') docs = loader.load() chunks = Chunker(docs, chunk_size=300, overlap=0).chunk()
This program loads a text file, splits it into chunks of 100 characters with 20 characters overlapping, then prints how many chunks were made and shows the start of the first chunk.
class DocumentLoader: def __init__(self, file_path): self.file_path = file_path def load(self): with open(self.file_path, 'r', encoding='utf-8') as f: return f.read() class Chunker: def __init__(self, text, chunk_size=500, overlap=50): self.text = text self.chunk_size = chunk_size self.overlap = overlap def chunk(self): chunks = [] start = 0 text_length = len(self.text) while start < text_length: end = min(start + self.chunk_size, text_length) chunks.append(self.text[start:end]) start += self.chunk_size - self.overlap return chunks # Sample usage loader = DocumentLoader('sample.txt') text = loader.load() chunker = Chunker(text, chunk_size=100, overlap=20) chunks = chunker.chunk() print(f'Total chunks created: {len(chunks)}') print('First chunk preview:', chunks[0][:50])
Overlap helps keep context between chunks but increases total data size.
Chunk size depends on the AI model's input limits and the document's nature.
Always check the chunk boundaries to avoid cutting sentences awkwardly.
Chunking breaks big documents into smaller, manageable pieces.
Overlap keeps context between chunks for better AI understanding.
Choosing chunk size and overlap depends on your AI model and task.
Practice
Solution
Step 1: Understand chunking concept
Chunking means splitting big documents into smaller parts so AI can handle them easily.Step 2: Identify the main goal
The goal is to make documents manageable, not to combine or translate them.Final Answer:
To break large documents into smaller, manageable pieces -> Option CQuick Check:
Chunking = breaking big documents [OK]
- Thinking chunking combines documents
- Confusing chunking with translation
- Assuming chunking removes punctuation
Solution
Step 1: Check parameter names
The standard parameters are usually namedchunk_sizeandoverlap.Step 2: Verify values make sense
Chunk size should be larger than overlap, so 500 and 50 is logical.Final Answer:
<code>loader.load(chunk_size=500, overlap=50)</code> -> Option BQuick Check:
Correct params = chunk_size and overlap [OK]
- Using wrong parameter names like size or chunk
- Swapping chunk size and overlap values
- Using overlap larger than chunk size
chunks = loader.load(chunk_size=100, overlap=20) print(len(chunks))
If the original document has 250 characters, what will be the output?
Solution
Step 1: Calculate chunk positions
Chunks start every (chunk_size - overlap) = 80 characters: positions 0, 80, 160, 240.Step 2: Count chunks covering 250 characters
Chunks at 0, 80, 160, and 240 cover the document. The last chunk at 240 covers 240-340, overlapping document end.Final Answer:
4 -> Option AQuick Check:
Chunks = ceil((250 - overlap) / (chunk_size - overlap)) = ceil((250 - 20) / 80) = ceil(230 / 80) = 3, but since the last chunk starts at 240, total chunks = 4 [OK]
- Ignoring overlap when counting chunks
- Assuming chunks equal document length divided by chunk size
- Not counting last partial chunk
chunks = loader.load(chunk_size=100, overlap=150)
What is the likely cause?
Solution
Step 1: Check parameter relationship
Overlap cannot be larger than chunk size because chunks would overlap more than their length.Step 2: Identify error cause
Setting overlap=150 with chunk_size=100 is invalid and causes error.Final Answer:
Overlap is larger than chunk size, causing invalid chunking -> Option DQuick Check:
Overlap <= chunk size [OK]
- Setting overlap larger than chunk size
- Assuming chunk size can be zero
- Ignoring parameter constraints
Solution
Step 1: Consider model token limit
Model can handle max 512 tokens, so chunk size must be ≤512.Step 2: Choose overlap for context
Overlap keeps context between chunks; 128 overlap with 256 chunk size balances size and context.Step 3: Evaluate other options
Zero overlap loses context; chunk size >512 exceeds limit; very small chunks increase overhead.Final Answer:
Use chunk size 256 with overlap 128 to keep context between chunks -> Option AQuick Check:
Chunk size ≤ token limit + overlap for context [OK]
- Ignoring token limit and using too large chunks
- Using zero overlap losing context
- Choosing too small chunks causing inefficiency
