NlpHow-ToBeginner · 4 min read

How to Evaluate Text Classifier in NLP: Metrics and Examples

To evaluate a text classifier in NLP, use metrics like accuracy, precision, recall, and F1 score to measure how well the model predicts labels. These metrics compare the model's predictions to true labels and help understand its performance on different aspects.

📐

Syntax

Use functions from libraries like sklearn.metrics to calculate evaluation metrics. The main functions are:

accuracy_score(y_true, y_pred): fraction of correct predictions.
precision_score(y_true, y_pred): how many predicted positives are actually positive.
recall_score(y_true, y_pred): how many actual positives are found by the model.
f1_score(y_true, y_pred): harmonic mean of precision and recall.

python

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# y_true: true labels
# y_pred: predicted labels

accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, average='weighted')
recall = recall_score(y_true, y_pred, average='weighted')
f1 = f1_score(y_true, y_pred, average='weighted')

💻

Example

This example shows how to train a simple text classifier using sklearn and evaluate it with common metrics.

python

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split

# Load dataset
categories = ['alt.atheism', 'sci.space']
data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

# Split data
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3, random_state=42)

# Convert text to numbers
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train classifier
clf = MultinomialNB()
clf.fit(X_train_tfidf, y_train)

# Predict
y_pred = clf.predict(X_test_tfidf)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print(f"Accuracy: {accuracy:.3f}")
print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"F1 Score: {f1:.3f}")

Output

Accuracy: 0.930 Precision: 0.930 Recall: 0.930 F1 Score: 0.930

⚠️

Common Pitfalls

Common mistakes when evaluating text classifiers include:

Using accuracy alone on imbalanced data can be misleading because it ignores class distribution.
Not specifying average parameter in precision, recall, and F1 for multi-class problems causes errors or wrong results.
Evaluating on training data instead of separate test data leads to overly optimistic scores.
Ignoring confusion matrix to understand types of errors.

python

from sklearn.metrics import precision_score

y_true = [0, 1, 1, 0, 1]
y_pred = [0, 0, 1, 0, 1]

# Wrong: no average for multi-class (raises error or wrong result)
# precision = precision_score(y_true, y_pred)

# Right: specify average
precision = precision_score(y_true, y_pred, average='binary')
print(f"Precision: {precision:.2f}")

Output

Precision: 1.00

📊

Quick Reference

Summary of key evaluation metrics for text classifiers:

Metric	What it Measures	Range	Best Value
Accuracy	Overall correct predictions	0 to 1	1 (100%)
Precision	Correct positive predictions	0 to 1	1 (100%)
Recall	Found actual positives	0 to 1	1 (100%)
F1 Score	Balance of precision and recall	0 to 1	1 (100%)

✅

Key Takeaways

Use accuracy, precision, recall, and F1 score together to get a full picture of classifier performance.

Always evaluate on separate test data to avoid biased results.

Specify the correct averaging method for multi-class classification metrics.

Accuracy alone can be misleading on imbalanced datasets.

Check confusion matrix to understand error types beyond metrics.