How to Evaluate Text Classifier in NLP: Metrics and Examples
To evaluate a text classifier in NLP, use metrics like
accuracy, precision, recall, and F1 score to measure how well the model predicts labels. These metrics compare the model's predictions to true labels and help understand its performance on different aspects.Syntax
Use functions from libraries like sklearn.metrics to calculate evaluation metrics. The main functions are:
accuracy_score(y_true, y_pred): fraction of correct predictions.precision_score(y_true, y_pred): how many predicted positives are actually positive.recall_score(y_true, y_pred): how many actual positives are found by the model.f1_score(y_true, y_pred): harmonic mean of precision and recall.
python
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score # y_true: true labels # y_pred: predicted labels accuracy = accuracy_score(y_true, y_pred) precision = precision_score(y_true, y_pred, average='weighted') recall = recall_score(y_true, y_pred, average='weighted') f1 = f1_score(y_true, y_pred, average='weighted')
Example
This example shows how to train a simple text classifier using sklearn and evaluate it with common metrics.
python
from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score from sklearn.model_selection import train_test_split # Load dataset categories = ['alt.atheism', 'sci.space'] data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes')) # Split data X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3, random_state=42) # Convert text to numbers vectorizer = TfidfVectorizer() X_train_tfidf = vectorizer.fit_transform(X_train) X_test_tfidf = vectorizer.transform(X_test) # Train classifier clf = MultinomialNB() clf.fit(X_train_tfidf, y_train) # Predict y_pred = clf.predict(X_test_tfidf) # Evaluate accuracy = accuracy_score(y_test, y_pred) precision = precision_score(y_test, y_pred, average='weighted') recall = recall_score(y_test, y_pred, average='weighted') f1 = f1_score(y_test, y_pred, average='weighted') print(f"Accuracy: {accuracy:.3f}") print(f"Precision: {precision:.3f}") print(f"Recall: {recall:.3f}") print(f"F1 Score: {f1:.3f}")
Output
Accuracy: 0.930
Precision: 0.930
Recall: 0.930
F1 Score: 0.930
Common Pitfalls
Common mistakes when evaluating text classifiers include:
- Using
accuracyalone on imbalanced data can be misleading because it ignores class distribution. - Not specifying
averageparameter in precision, recall, and F1 for multi-class problems causes errors or wrong results. - Evaluating on training data instead of separate test data leads to overly optimistic scores.
- Ignoring confusion matrix to understand types of errors.
python
from sklearn.metrics import precision_score y_true = [0, 1, 1, 0, 1] y_pred = [0, 0, 1, 0, 1] # Wrong: no average for multi-class (raises error or wrong result) # precision = precision_score(y_true, y_pred) # Right: specify average precision = precision_score(y_true, y_pred, average='binary') print(f"Precision: {precision:.2f}")
Output
Precision: 1.00
Quick Reference
Summary of key evaluation metrics for text classifiers:
| Metric | What it Measures | Range | Best Value |
|---|---|---|---|
| Accuracy | Overall correct predictions | 0 to 1 | 1 (100%) |
| Precision | Correct positive predictions | 0 to 1 | 1 (100%) |
| Recall | Found actual positives | 0 to 1 | 1 (100%) |
| F1 Score | Balance of precision and recall | 0 to 1 | 1 (100%) |
Key Takeaways
Use accuracy, precision, recall, and F1 score together to get a full picture of classifier performance.
Always evaluate on separate test data to avoid biased results.
Specify the correct averaging method for multi-class classification metrics.
Accuracy alone can be misleading on imbalanced datasets.
Check confusion matrix to understand error types beyond metrics.
