Multi-class text classification helps us sort text into many groups. It makes understanding and organizing text easier.
Multi-class text classification in NLP
Start learning this pattern below
Jump into concepts and practice - no test required
or
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Introduction
Syntax
NLP
model = SomeClassifier() model.fit(X_train, y_train) predictions = model.predict(X_test)
X_train is the text data for training.
y_train is the label for each text showing its category.
Examples
NLP
from sklearn.feature_extraction.text import CountVectorizer from sklearn.linear_model import LogisticRegression texts = ['I love cats', 'The sky is blue', 'Python is great'] labels = ['pets', 'nature', 'programming'] vectorizer = CountVectorizer() X = vectorizer.fit_transform(texts) model = LogisticRegression(multi_class='ovr') model.fit(X, labels) new_text = ['I like dogs'] X_new = vectorizer.transform(new_text) pred = model.predict(X_new) print(pred)
NLP
from sklearn.pipeline import make_pipeline from sklearn.naive_bayes import MultinomialNB from sklearn.feature_extraction.text import CountVectorizer texts = ['apple is tasty', 'football is fun', 'coding is creative'] labels = ['food', 'sports', 'tech'] model = make_pipeline(CountVectorizer(), MultinomialNB()) model.fit(texts, labels) print(model.predict(['I enjoy basketball']))
Sample Model
This program trains a model to classify text into three categories from a real dataset. It shows accuracy and one example prediction.
NLP
from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score # Load a small subset of data for speed categories = ['alt.atheism', 'comp.graphics', 'sci.space'] data_train = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'footers', 'quotes')) data_test = fetch_20newsgroups(subset='test', categories=categories, remove=('headers', 'footers', 'quotes')) # Convert text to numbers vectorizer = TfidfVectorizer(stop_words='english', max_features=1000) X_train = vectorizer.fit_transform(data_train.data) X_test = vectorizer.transform(data_test.data) # Train model model = LogisticRegression(max_iter=1000, multi_class='ovr') model.fit(X_train, data_train.target) # Predict on test data predictions = model.predict(X_test) # Calculate accuracy acc = accuracy_score(data_test.target, predictions) print(f"Accuracy: {acc:.3f}") print(f"Sample prediction for first test text: {data_test.target_names[predictions[0]]}")
Important Notes
Text must be converted to numbers before training a model.
Multi-class means more than two categories to choose from.
Accuracy shows how often the model guesses right.
Summary
Multi-class text classification sorts text into many groups.
We turn text into numbers, then train a model to learn patterns.
Models predict categories and we check accuracy to see how well they work.
Practice
1. What is the main goal of multi-class text classification?
easy
Solution
Step 1: Understand the task of multi-class text classification
This task involves assigning each text sample to one category out of many possible categories.Step 2: Compare options with the task definition
Only To sort text into multiple categories based on content describes sorting text into multiple categories, which matches the task.Final Answer:
To sort text into multiple categories based on content -> Option AQuick Check:
Multi-class classification = sorting into many categories [OK]
Hint: Multi-class means sorting text into many groups [OK]
Common Mistakes:
- Confusing classification with translation
- Thinking it counts words instead of categorizing
- Mixing generation with classification
2. Which of the following is the correct way to represent text data for multi-class classification?
easy
Solution
Step 1: Identify how models process text
Models cannot understand raw text strings; they need numbers to learn patterns.Step 2: Check which option converts text to numbers
Converting text into numerical vectors like TF-IDF or embeddings mentions converting text into numerical vectors like TF-IDF or embeddings, which is correct.Final Answer:
Converting text into numerical vectors like TF-IDF or embeddings -> Option AQuick Check:
Text must be numbers for models [OK]
Hint: Models need numbers, not raw text, to learn [OK]
Common Mistakes:
- Feeding raw text directly to models
- Thinking sorting text helps classification
- Ignoring the need for numerical representation
3. Given the following Python code snippet for multi-class text classification, what will be the output of
print(predicted_class)?
from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB texts = ["I love cats", "Dogs are great", "I hate rain"] labels = ["positive", "positive", "negative"] vectorizer = CountVectorizer() X = vectorizer.fit_transform(texts) model = MultinomialNB() model.fit(X, labels) new_text = ["I love dogs"] X_new = vectorizer.transform(new_text) predicted_class = model.predict(X_new)[0]
medium
Solution
Step 1: Understand training data and labels
The model is trained on texts labeled as "positive" or "negative". "I love cats" and "Dogs are great" are positive, "I hate rain" is negative.Step 2: Predict class for new text "I love dogs"
The new text contains words "I", "love", and "dogs" which appear in positive examples. The model predicts "positive" as the class.Final Answer:
"positive" -> Option DQuick Check:
New text matches positive words, so prediction is positive [OK]
Hint: New text similar to positive examples predicts positive [OK]
Common Mistakes:
- Assuming unknown words cause errors
- Choosing negative because of 'dogs' only
- Picking neutral which is not a trained label
4. Identify the error in this multi-class text classification code snippet:
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression texts = ["happy day", "sad night", "joyful morning"] labels = ["positive", "negative", "positive"] vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(texts) model = LogisticRegression() model.fit(texts, labels)
medium
Solution
Step 1: Check input to model.fit()
The model.fit() method expects numerical features, but raw texts are passed instead of vectorized data.Step 2: Identify correct input
The vectorized data X should be passed to model.fit, not the original texts.Final Answer:
Passing raw texts instead of vectorized data to model.fit -> Option BQuick Check:
Model needs numbers, not raw text, for training [OK]
Hint: Model.fit needs vectorized data, not raw text [OK]
Common Mistakes:
- Passing raw text directly to model.fit
- Thinking label type causes error here
- Believing vectorizer choice causes this error
5. You have a dataset with 5 classes and highly imbalanced text samples per class. Which approach best improves multi-class classification performance?
hard
Solution
Step 1: Understand class imbalance impact
Imbalanced classes cause models to favor majority classes, reducing accuracy on minority classes.Step 2: Identify best practice to handle imbalance
Using class weighting or oversampling balances the training data, helping the model learn all classes better.Final Answer:
Use class weighting or oversampling to balance training data -> Option CQuick Check:
Balance data to improve multi-class model accuracy [OK]
Hint: Balance classes with weighting or oversampling [OK]
Common Mistakes:
- Ignoring imbalance and expecting good results
- Removing minority classes loses valuable data
- Predicting only the majority class ignores others
