In multi-class text classification, the goal is to correctly assign each text to one category out of many. The key metrics are accuracy, precision, recall, and F1-score calculated per class and averaged. Accuracy shows overall correct predictions. Precision tells us how many predicted texts for a class truly belong there. Recall shows how many texts of a class were found by the model. F1-score balances precision and recall. These metrics help us understand if the model is good at finding and correctly labeling each class.
Multi-class text classification in NLP - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
Here is an example confusion matrix for 3 classes: Sports, Politics, and Tech.
Predicted
S P T
True S 50 2 3
P 4 45 1
T 2 3 48
Explanation:
- 50 texts truly Sports were predicted as Sports (True Positives for Sports)
- 2 Sports texts were wrongly predicted as Politics (False Negatives for Sports)
- 3 Sports texts were wrongly predicted as Tech (False Negatives for Sports)
- Similarly for Politics and Tech classes.
From this matrix, we calculate precision, recall, and F1 for each class.
Imagine a model classifying news articles into categories. If the model has high precision for Sports, it means when it says an article is Sports, it is usually right. But if recall is low, it misses many Sports articles.
For example, if you want to recommend Sports articles only when very sure, prioritize precision. If you want to catch all Sports articles even if some mistakes happen, prioritize recall.
Balancing precision and recall with F1-score helps when both false positives and false negatives matter.
Good metrics:
- Accuracy above 80% on balanced data
- Precision and recall above 75% for each class
- F1-score close to precision and recall, showing balance
Bad metrics:
- Accuracy near random guess (e.g., 33% for 3 classes)
- Very low recall for some classes (missing many texts)
- High precision but very low recall or vice versa (unbalanced)
- Accuracy paradox: High accuracy can be misleading if classes are imbalanced. For example, if one class is 90% of data, predicting it always gives 90% accuracy but poor performance on others.
- Ignoring per-class metrics: Overall accuracy hides poor results on minority classes.
- Data leakage: If test data leaks into training, metrics look unrealistically high.
- Overfitting: Very high training accuracy but low test accuracy means model memorizes training data, not generalizing.
Your multi-class text classifier has 98% accuracy but recall for one important class is only 12%. Is this model good for production? Why or why not?
Answer: No, it is not good. The high accuracy likely comes from correctly predicting the majority classes. But the very low recall for the important class means the model misses most texts of that class. This can cause serious problems if that class is critical. You should improve recall for that class before using the model.
Practice
Solution
Step 1: Understand the task of multi-class text classification
This task involves assigning each text sample to one category out of many possible categories.Step 2: Compare options with the task definition
Only To sort text into multiple categories based on content describes sorting text into multiple categories, which matches the task.Final Answer:
To sort text into multiple categories based on content -> Option AQuick Check:
Multi-class classification = sorting into many categories [OK]
- Confusing classification with translation
- Thinking it counts words instead of categorizing
- Mixing generation with classification
Solution
Step 1: Identify how models process text
Models cannot understand raw text strings; they need numbers to learn patterns.Step 2: Check which option converts text to numbers
Converting text into numerical vectors like TF-IDF or embeddings mentions converting text into numerical vectors like TF-IDF or embeddings, which is correct.Final Answer:
Converting text into numerical vectors like TF-IDF or embeddings -> Option AQuick Check:
Text must be numbers for models [OK]
- Feeding raw text directly to models
- Thinking sorting text helps classification
- Ignoring the need for numerical representation
print(predicted_class)?
from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB texts = ["I love cats", "Dogs are great", "I hate rain"] labels = ["positive", "positive", "negative"] vectorizer = CountVectorizer() X = vectorizer.fit_transform(texts) model = MultinomialNB() model.fit(X, labels) new_text = ["I love dogs"] X_new = vectorizer.transform(new_text) predicted_class = model.predict(X_new)[0]
Solution
Step 1: Understand training data and labels
The model is trained on texts labeled as "positive" or "negative". "I love cats" and "Dogs are great" are positive, "I hate rain" is negative.Step 2: Predict class for new text "I love dogs"
The new text contains words "I", "love", and "dogs" which appear in positive examples. The model predicts "positive" as the class.Final Answer:
"positive" -> Option DQuick Check:
New text matches positive words, so prediction is positive [OK]
- Assuming unknown words cause errors
- Choosing negative because of 'dogs' only
- Picking neutral which is not a trained label
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression texts = ["happy day", "sad night", "joyful morning"] labels = ["positive", "negative", "positive"] vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(texts) model = LogisticRegression() model.fit(texts, labels)
Solution
Step 1: Check input to model.fit()
The model.fit() method expects numerical features, but raw texts are passed instead of vectorized data.Step 2: Identify correct input
The vectorized data X should be passed to model.fit, not the original texts.Final Answer:
Passing raw texts instead of vectorized data to model.fit -> Option BQuick Check:
Model needs numbers, not raw text, for training [OK]
- Passing raw text directly to model.fit
- Thinking label type causes error here
- Believing vectorizer choice causes this error
Solution
Step 1: Understand class imbalance impact
Imbalanced classes cause models to favor majority classes, reducing accuracy on minority classes.Step 2: Identify best practice to handle imbalance
Using class weighting or oversampling balances the training data, helping the model learn all classes better.Final Answer:
Use class weighting or oversampling to balance training data -> Option CQuick Check:
Balance data to improve multi-class model accuracy [OK]
- Ignoring imbalance and expecting good results
- Removing minority classes loses valuable data
- Predicting only the majority class ignores others
