Bird
Raised Fist0
NLPml~5 mins

Multi-class text classification in NLP

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Introduction

Multi-class text classification helps us sort text into many groups. It makes understanding and organizing text easier.

Sorting emails into categories like work, personal, or spam.
Classifying news articles into topics like sports, politics, or technology.
Organizing customer reviews by sentiment: positive, neutral, or negative.
Tagging social media posts by subject such as travel, food, or fashion.
Syntax
NLP
model = SomeClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

X_train is the text data for training.

y_train is the label for each text showing its category.

Examples
This example shows how to train a simple model to classify text into three categories.
NLP
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

texts = ['I love cats', 'The sky is blue', 'Python is great']
labels = ['pets', 'nature', 'programming']

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

model = LogisticRegression(multi_class='ovr')
model.fit(X, labels)

new_text = ['I like dogs']
X_new = vectorizer.transform(new_text)
pred = model.predict(X_new)
print(pred)
This example uses a pipeline to combine text vectorization and classification in one step.
NLP
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

texts = ['apple is tasty', 'football is fun', 'coding is creative']
labels = ['food', 'sports', 'tech']

model = make_pipeline(CountVectorizer(), MultinomialNB())
model.fit(texts, labels)

print(model.predict(['I enjoy basketball']))
Sample Model

This program trains a model to classify text into three categories from a real dataset. It shows accuracy and one example prediction.

NLP
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load a small subset of data for speed
categories = ['alt.atheism', 'comp.graphics', 'sci.space']
data_train = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'footers', 'quotes'))
data_test = fetch_20newsgroups(subset='test', categories=categories, remove=('headers', 'footers', 'quotes'))

# Convert text to numbers
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
X_train = vectorizer.fit_transform(data_train.data)
X_test = vectorizer.transform(data_test.data)

# Train model
model = LogisticRegression(max_iter=1000, multi_class='ovr')
model.fit(X_train, data_train.target)

# Predict on test data
predictions = model.predict(X_test)

# Calculate accuracy
acc = accuracy_score(data_test.target, predictions)

print(f"Accuracy: {acc:.3f}")
print(f"Sample prediction for first test text: {data_test.target_names[predictions[0]]}")
OutputSuccess
Important Notes

Text must be converted to numbers before training a model.

Multi-class means more than two categories to choose from.

Accuracy shows how often the model guesses right.

Summary

Multi-class text classification sorts text into many groups.

We turn text into numbers, then train a model to learn patterns.

Models predict categories and we check accuracy to see how well they work.

Practice

(1/5)
1. What is the main goal of multi-class text classification?
easy
A. To sort text into multiple categories based on content
B. To translate text into another language
C. To count the number of words in a text
D. To generate new text from a given input

Solution

  1. Step 1: Understand the task of multi-class text classification

    This task involves assigning each text sample to one category out of many possible categories.
  2. Step 2: Compare options with the task definition

    Only To sort text into multiple categories based on content describes sorting text into multiple categories, which matches the task.
  3. Final Answer:

    To sort text into multiple categories based on content -> Option A
  4. Quick Check:

    Multi-class classification = sorting into many categories [OK]
Hint: Multi-class means sorting text into many groups [OK]
Common Mistakes:
  • Confusing classification with translation
  • Thinking it counts words instead of categorizing
  • Mixing generation with classification
2. Which of the following is the correct way to represent text data for multi-class classification?
easy
A. Converting text into numerical vectors like TF-IDF or embeddings
B. Using raw text strings directly as input to the model
C. Sorting text alphabetically before training
D. Removing all punctuation and spaces only

Solution

  1. Step 1: Identify how models process text

    Models cannot understand raw text strings; they need numbers to learn patterns.
  2. Step 2: Check which option converts text to numbers

    Converting text into numerical vectors like TF-IDF or embeddings mentions converting text into numerical vectors like TF-IDF or embeddings, which is correct.
  3. Final Answer:

    Converting text into numerical vectors like TF-IDF or embeddings -> Option A
  4. Quick Check:

    Text must be numbers for models [OK]
Hint: Models need numbers, not raw text, to learn [OK]
Common Mistakes:
  • Feeding raw text directly to models
  • Thinking sorting text helps classification
  • Ignoring the need for numerical representation
3. Given the following Python code snippet for multi-class text classification, what will be the output of print(predicted_class)?
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

texts = ["I love cats", "Dogs are great", "I hate rain"]
labels = ["positive", "positive", "negative"]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

model = MultinomialNB()
model.fit(X, labels)

new_text = ["I love dogs"]
X_new = vectorizer.transform(new_text)
predicted_class = model.predict(X_new)[0]
medium
A. "neutral"
B. "negative"
C. An error because of unknown words
D. "positive"

Solution

  1. Step 1: Understand training data and labels

    The model is trained on texts labeled as "positive" or "negative". "I love cats" and "Dogs are great" are positive, "I hate rain" is negative.
  2. Step 2: Predict class for new text "I love dogs"

    The new text contains words "I", "love", and "dogs" which appear in positive examples. The model predicts "positive" as the class.
  3. Final Answer:

    "positive" -> Option D
  4. Quick Check:

    New text matches positive words, so prediction is positive [OK]
Hint: New text similar to positive examples predicts positive [OK]
Common Mistakes:
  • Assuming unknown words cause errors
  • Choosing negative because of 'dogs' only
  • Picking neutral which is not a trained label
4. Identify the error in this multi-class text classification code snippet:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

texts = ["happy day", "sad night", "joyful morning"]
labels = ["positive", "negative", "positive"]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)

model = LogisticRegression()
model.fit(texts, labels)
medium
A. Labels should be numbers, not strings
B. Passing raw texts instead of vectorized data to model.fit
C. Using LogisticRegression instead of Naive Bayes
D. TfidfVectorizer cannot be used with LogisticRegression

Solution

  1. Step 1: Check input to model.fit()

    The model.fit() method expects numerical features, but raw texts are passed instead of vectorized data.
  2. Step 2: Identify correct input

    The vectorized data X should be passed to model.fit, not the original texts.
  3. Final Answer:

    Passing raw texts instead of vectorized data to model.fit -> Option B
  4. Quick Check:

    Model needs numbers, not raw text, for training [OK]
Hint: Model.fit needs vectorized data, not raw text [OK]
Common Mistakes:
  • Passing raw text directly to model.fit
  • Thinking label type causes error here
  • Believing vectorizer choice causes this error
5. You have a dataset with 5 classes and highly imbalanced text samples per class. Which approach best improves multi-class classification performance?
hard
A. Use only the most frequent class for prediction
B. Ignore imbalance and train model on raw data
C. Use class weighting or oversampling to balance training data
D. Remove classes with fewer samples to simplify problem

Solution

  1. Step 1: Understand class imbalance impact

    Imbalanced classes cause models to favor majority classes, reducing accuracy on minority classes.
  2. Step 2: Identify best practice to handle imbalance

    Using class weighting or oversampling balances the training data, helping the model learn all classes better.
  3. Final Answer:

    Use class weighting or oversampling to balance training data -> Option C
  4. Quick Check:

    Balance data to improve multi-class model accuracy [OK]
Hint: Balance classes with weighting or oversampling [OK]
Common Mistakes:
  • Ignoring imbalance and expecting good results
  • Removing minority classes loses valuable data
  • Predicting only the majority class ignores others