One-hot encoding is a way to turn words into numbers so a computer can understand text. It does not create a model by itself but helps prepare data. When using one-hot encoded text in models, common metrics like accuracy, precision, and recall matter to check how well the model learns from this data. The choice depends on the task: for example, accuracy is good for balanced classes, while precision or recall matter more for imbalanced classes.
One-hot encoding for text in NLP - Model Metrics & Evaluation
Imagine a text classification model using one-hot encoded words to detect spam emails. Here is a confusion matrix from testing:
| Predicted Spam | Predicted Not Spam |
|----------------|--------------------|
| True Spam: 40 | False Not Spam: 10 |
| False Spam: 5 | True Not Spam: 45 |
Total samples = 40 + 10 + 5 + 45 = 100
Precision tells us how many emails marked as spam really are spam. Recall tells us how many actual spam emails we found.
For spam detection, high precision means fewer good emails wrongly marked as spam (important to avoid losing important messages). High recall means catching most spam emails (important to keep inbox clean).
Sometimes improving precision lowers recall and vice versa. Choosing which to focus on depends on what is worse: missing spam or wrongly blocking good emails.
Good values:
- Accuracy above 85% on balanced data
- Precision and recall both above 80% for important classes
- F1 score (balance of precision and recall) above 0.8
Bad values:
- Accuracy near 50% on balanced data (like guessing)
- Precision very low (many false alarms)
- Recall very low (many missed cases)
- F1 score below 0.5 showing poor balance
- Accuracy paradox: High accuracy can be misleading if classes are imbalanced (e.g., 95% accuracy by always predicting the majority class).
- Data leakage: If test data leaks into training, metrics look better but model fails in real use.
- Overfitting: Very high training accuracy but low test accuracy means model memorizes training data, not generalizing well.
- Ignoring class imbalance: Metrics like accuracy hide poor performance on rare classes; use precision, recall, or F1 instead.
Your text classification model using one-hot encoding has 98% accuracy but only 12% recall on the spam class. Is it good for production? Why or why not?
Answer: No, it is not good. The model misses most spam emails (low recall), even though overall accuracy is high. This likely happens because spam is rare, so the model mostly predicts non-spam. For spam detection, missing spam is bad, so recall must improve.