We split big documents into smaller parts to help AI understand and work with them better.
0
0
Document loading and chunking strategies in Agentic AI
Introduction
When you have a long report and want the AI to find specific information quickly.
When feeding text into an AI model that can only handle limited text size at once.
When you want to organize a book into chapters for easier searching.
When preparing documents for AI to summarize or answer questions about.
When you want to improve AI speed by processing smaller text pieces.
Syntax
Agentic AI
loader = DocumentLoader('file_path') documents = loader.load() chunks = Chunker(documents, chunk_size=500, overlap=50).chunk()
DocumentLoader reads the full document from a file or source.
Chunker splits the document into smaller pieces with optional overlap to keep context.
Examples
This loads a report and splits it into chunks of 1000 characters with 100 characters overlapping.
Agentic AI
loader = DocumentLoader('report.txt') docs = loader.load() chunker = Chunker(docs, chunk_size=1000, overlap=100) chunks = chunker.chunk()
This loads a PDF book and splits it into chunks of 300 characters without overlap.
Agentic AI
loader = DocumentLoader('book.pdf') docs = loader.load() chunks = Chunker(docs, chunk_size=300, overlap=0).chunk()
Sample Model
This program loads a text file, splits it into chunks of 100 characters with 20 characters overlapping, then prints how many chunks were made and shows the start of the first chunk.
Agentic AI
class DocumentLoader: def __init__(self, file_path): self.file_path = file_path def load(self): with open(self.file_path, 'r', encoding='utf-8') as f: return f.read() class Chunker: def __init__(self, text, chunk_size=500, overlap=50): self.text = text self.chunk_size = chunk_size self.overlap = overlap def chunk(self): chunks = [] start = 0 text_length = len(self.text) while start < text_length: end = min(start + self.chunk_size, text_length) chunks.append(self.text[start:end]) start += self.chunk_size - self.overlap return chunks # Sample usage loader = DocumentLoader('sample.txt') text = loader.load() chunker = Chunker(text, chunk_size=100, overlap=20) chunks = chunker.chunk() print(f'Total chunks created: {len(chunks)}') print('First chunk preview:', chunks[0][:50])
OutputSuccess
Important Notes
Overlap helps keep context between chunks but increases total data size.
Chunk size depends on the AI model's input limits and the document's nature.
Always check the chunk boundaries to avoid cutting sentences awkwardly.
Summary
Chunking breaks big documents into smaller, manageable pieces.
Overlap keeps context between chunks for better AI understanding.
Choosing chunk size and overlap depends on your AI model and task.