Introduction
When working with large amounts of text, it can be hard to process or understand everything at once. Breaking text into smaller, manageable pieces helps computers and people handle information more easily and accurately.
Jump into concepts and practice - no test required
Imagine you have a long storybook to share with friends. You can cut it into equal pages, split it by chapters, group parts by themes, or share some sentences twice between friends to keep the story connected.
┌───────────────┐
│ Full Text │
└──────┬────────┘
│
▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Fixed-size │ │ Sentence-based │ │ Semantic │ │ Overlap │
│ chunks │ │ chunks │ │ chunks │ │ chunks │
│ [equal parts] │ │ [by sentences]│ │ [by meaning] │ │ [shared text] │
└───────────────┘ └───────────────┘ └───────────────┘ └───────────────┘text chunking in AI models?chunk_size - overlap as step, correctly creating overlaps.text = 'abcdefghij', chunk_size = 4, and overlap = 2, what is the output of this code?chunks = [text[i:i+chunk_size] for i in range(0, len(text)-overlap, chunk_size - overlap)] print(chunks)
chunk_size = 5
overlap = 2
chunks = []
for i in range(0, len(text), chunk_size + overlap):
chunks.append(text[i:i+chunk_size])
print(chunks)chunk_size + overlap which skips overlap, causing gaps.