Text chunking breaks text into meaningful parts like phrases. The key metrics are Precision, Recall, and F1-score. Precision shows how many chunks found are correct. Recall shows how many correct chunks were found. F1-score balances both. These matter because chunking needs to find correct parts without missing or adding wrong ones.
Text chunking strategies in Prompt Engineering / GenAI - Model Metrics & Evaluation
| Predicted Chunk | Predicted No Chunk
-----------------------------------------
Actual Chunk | TP=80 | FN=20
Actual No Chunk| FP=15 | TN=85
-----------------------------------------
Total samples = 200
Here, TP means correctly found chunks, FP means wrongly found chunks, FN means missed chunks, and TN means correctly ignored parts.
If you want to avoid wrong chunks (high precision), you may miss some correct chunks (lower recall). For example, in medical text, wrong chunks can confuse diagnosis, so high precision is key.
If you want to find all chunks (high recall), you may include wrong chunks (lower precision). For example, in search engines, finding all possible phrases is important even if some are wrong.
Good: Precision and Recall both above 0.8, F1-score near 0.85 or higher. This means most chunks found are correct and most correct chunks are found.
Bad: Precision or Recall below 0.5 means many wrong chunks or many missed chunks. F1-score below 0.6 shows poor balance and unreliable chunking.
- Accuracy paradox: High accuracy can happen if most text is no chunk, but chunk detection is poor.
- Data leakage: Using test text in training can inflate metrics falsely.
- Overfitting: Very high training metrics but low test metrics means model memorizes chunks, not generalizes.
Your chunking model has 98% accuracy but 12% recall on chunks. Is it good?
Answer: No. The model misses most chunks (low recall), so it is not useful despite high accuracy caused by many no chunk parts.