0
0
Agentic AIml~5 mins

Document loading and chunking strategies in Agentic AI

Choose your learning style9 modes available
Introduction

We split big documents into smaller parts to help AI understand and work with them better.

When you have a long report and want the AI to find specific information quickly.
When feeding text into an AI model that can only handle limited text size at once.
When you want to organize a book into chapters for easier searching.
When preparing documents for AI to summarize or answer questions about.
When you want to improve AI speed by processing smaller text pieces.
Syntax
Agentic AI
loader = DocumentLoader('file_path')
documents = loader.load()
chunks = Chunker(documents, chunk_size=500, overlap=50).chunk()

DocumentLoader reads the full document from a file or source.

Chunker splits the document into smaller pieces with optional overlap to keep context.

Examples
This loads a report and splits it into chunks of 1000 characters with 100 characters overlapping.
Agentic AI
loader = DocumentLoader('report.txt')
docs = loader.load()
chunker = Chunker(docs, chunk_size=1000, overlap=100)
chunks = chunker.chunk()
This loads a PDF book and splits it into chunks of 300 characters without overlap.
Agentic AI
loader = DocumentLoader('book.pdf')
docs = loader.load()
chunks = Chunker(docs, chunk_size=300, overlap=0).chunk()
Sample Model

This program loads a text file, splits it into chunks of 100 characters with 20 characters overlapping, then prints how many chunks were made and shows the start of the first chunk.

Agentic AI
class DocumentLoader:
    def __init__(self, file_path):
        self.file_path = file_path

    def load(self):
        with open(self.file_path, 'r', encoding='utf-8') as f:
            return f.read()

class Chunker:
    def __init__(self, text, chunk_size=500, overlap=50):
        self.text = text
        self.chunk_size = chunk_size
        self.overlap = overlap

    def chunk(self):
        chunks = []
        start = 0
        text_length = len(self.text)
        while start < text_length:
            end = min(start + self.chunk_size, text_length)
            chunks.append(self.text[start:end])
            start += self.chunk_size - self.overlap
        return chunks

# Sample usage
loader = DocumentLoader('sample.txt')
text = loader.load()
chunker = Chunker(text, chunk_size=100, overlap=20)
chunks = chunker.chunk()

print(f'Total chunks created: {len(chunks)}')
print('First chunk preview:', chunks[0][:50])
OutputSuccess
Important Notes

Overlap helps keep context between chunks but increases total data size.

Chunk size depends on the AI model's input limits and the document's nature.

Always check the chunk boundaries to avoid cutting sentences awkwardly.

Summary

Chunking breaks big documents into smaller, manageable pieces.

Overlap keeps context between chunks for better AI understanding.

Choosing chunk size and overlap depends on your AI model and task.