Text chunking breaks text into meaningful parts like phrases. The key metrics are Precision, Recall, and F1-score. Precision shows how many chunks found are correct. Recall shows how many correct chunks were found. F1-score balances both. These matter because chunking needs to find correct parts without missing or adding wrong ones.
Text chunking strategies in Prompt Engineering / GenAI - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
| Predicted Chunk | Predicted No Chunk
-----------------------------------------
Actual Chunk | TP=80 | FN=20
Actual No Chunk| FP=15 | TN=85
-----------------------------------------
Total samples = 200
Here, TP means correctly found chunks, FP means wrongly found chunks, FN means missed chunks, and TN means correctly ignored parts.
If you want to avoid wrong chunks (high precision), you may miss some correct chunks (lower recall). For example, in medical text, wrong chunks can confuse diagnosis, so high precision is key.
If you want to find all chunks (high recall), you may include wrong chunks (lower precision). For example, in search engines, finding all possible phrases is important even if some are wrong.
Good: Precision and Recall both above 0.8, F1-score near 0.85 or higher. This means most chunks found are correct and most correct chunks are found.
Bad: Precision or Recall below 0.5 means many wrong chunks or many missed chunks. F1-score below 0.6 shows poor balance and unreliable chunking.
- Accuracy paradox: High accuracy can happen if most text is no chunk, but chunk detection is poor.
- Data leakage: Using test text in training can inflate metrics falsely.
- Overfitting: Very high training metrics but low test metrics means model memorizes chunks, not generalizes.
Your chunking model has 98% accuracy but 12% recall on chunks. Is it good?
Answer: No. The model misses most chunks (low recall), so it is not useful despite high accuracy caused by many no chunk parts.
Practice
text chunking in AI models?Solution
Step 1: Understand the concept of text chunking
Text chunking means breaking a long text into smaller parts so it is easier to handle.Step 2: Identify the main goal in AI context
This helps AI models process and understand large texts better by working on smaller pieces.Final Answer:
To split long text into smaller, manageable pieces -> Option BQuick Check:
Text chunking = splitting text [OK]
- Confusing chunking with translation
- Thinking chunking removes words
- Believing chunking generates new text
Solution
Step 1: Understand overlapping chunk logic
To create overlapping chunks, the step size must be smaller than chunk size by the overlap amount.Step 2: Check the range step in options
chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size - overlap)] useschunk_size - overlapas step, correctly creating overlaps.Final Answer:
chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size - overlap)] -> Option CQuick Check:
Overlap step = chunk_size - overlap [OK]
- Using chunk_size as step (no overlap)
- Using overlap as step (too small steps)
- Starting range at overlap instead of zero
text = 'abcdefghij', chunk_size = 4, and overlap = 2, what is the output of this code?chunks = [text[i:i+chunk_size] for i in range(0, len(text)-overlap, chunk_size - overlap)] print(chunks)
Solution
Step 1: Calculate step size
Step = chunk_size - overlap = 4 - 2 = 2.Step 2: Generate chunks using step 2
Chunks are:
i=0: text[0:4] = 'abcd'
i=2: text[2:6] = 'cdef'
i=4: text[4:8] = 'efgh'
i=6: text[6:10] = 'ghij'Final Answer:
['abcd', 'cdef', 'efgh', 'ghij'] -> Option AQuick Check:
Chunks overlap by 2 chars = ['abcd', 'cdef', 'efgh', 'ghij'] [OK]
- Ignoring overlap and stepping by chunk size
- Wrong slicing indices
- Confusing overlap with chunk size
chunk_size = 5
overlap = 2
chunks = []
for i in range(0, len(text), chunk_size + overlap):
chunks.append(text[i:i+chunk_size])
print(chunks)What is the error?
Solution
Step 1: Understand step size for overlapping chunks
To create overlap, step size must be less than chunk size by overlap amount.Step 2: Identify incorrect step in code
Code useschunk_size + overlapwhich skips overlap, causing gaps.Final Answer:
Step size should be chunk_size - overlap, not chunk_size + overlap -> Option AQuick Check:
Overlap step = chunk_size - overlap [OK]
- Adding overlap instead of subtracting
- Setting overlap to zero incorrectly
- Changing loop start index wrongly
Solution
Step 1: Define chunk and step sizes for overlap
Chunk size is 100 words, overlap is 20 words, so step size = 100 - 20 = 80.Step 2: Choose correct step size to maintain overlap
Step size 80 means each chunk starts 80 words after previous, overlapping 20 words.Final Answer:
Use chunk size 100 and step size 80 (100 - 20) to create overlapping chunks -> Option DQuick Check:
Step = chunk size - overlap = 80 [OK]
- Using step size larger than chunk size
- Setting overlap to zero accidentally
- Confusing chunk size with step size
