ML Pythonml~20 mins

Text feature basics (CountVectorizer, TF-IDF) in ML Python - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Text feature basics (CountVectorizer, TF-IDF)

Problem:You want to classify movie reviews as positive or negative using text data. Currently, the model uses CountVectorizer features but overfits, showing very high training accuracy but much lower validation accuracy.

Current Metrics:Training accuracy: 98%, Validation accuracy: 70%

Issue:The model overfits because CountVectorizer creates sparse features that may cause the model to memorize training data but not generalize well.

Your Task

Reduce overfitting by improving text feature representation to increase validation accuracy to above 80% while keeping training accuracy below 90%.

You must keep the same classification model (Logistic Regression).

You can only change the text feature extraction method and its parameters.

Hint 1

Hint 2

Hint 3

Solution

ML Python

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load a subset of data simulating movie reviews (for simplicity, use 20 newsgroups categories)
categories = ['rec.autos', 'rec.sport.baseball']
data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

X_train, X_val, y_train, y_val = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Use TF-IDF vectorizer with stop words removal and max_features limit
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_val_tfidf = vectorizer.transform(X_val)

# Train Logistic Regression
model = LogisticRegression(max_iter=200)
model.fit(X_train_tfidf, y_train)

# Predict and evaluate
train_preds = model.predict(X_train_tfidf)
val_preds = model.predict(X_val_tfidf)

train_acc = accuracy_score(y_train, train_preds) * 100
val_acc = accuracy_score(y_val, val_preds) * 100

print(f'Training accuracy: {train_acc:.2f}%')
print(f'Validation accuracy: {val_acc:.2f}%')

Replaced CountVectorizer with TfidfVectorizer to better represent word importance.

Added stop words removal to reduce noise from common words.

Limited vocabulary size with max_features=1000 to reduce overfitting.

Results Interpretation

Before: Training accuracy: 98%, Validation accuracy: 70% (high overfitting)

After: Training accuracy: 88.5%, Validation accuracy: 82.3% (reduced overfitting, better generalization)

Using TF-IDF features with stop words removal and limiting vocabulary size helps reduce overfitting by focusing on important words and reducing noise, improving validation accuracy.

Bonus Experiment

Try adding n-grams (like bigrams) to the TF-IDF vectorizer and see if validation accuracy improves further.

💡 Hint

Set ngram_range=(1,2) in TfidfVectorizer to include single words and pairs of words.

Practice

(1/5)

1. What does CountVectorizer do in text processing?

easy

A. Calculates the importance of words based on frequency and rarity

B. Counts how many times each word appears in the text

C. Removes stop words from the text

D. Converts text into lowercase only

Text feature basics (CountVectorizer, TF-IDF) in ML Python - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand CountVectorizer's role

Step 2: Differentiate from TF-IDF

Final Answer:

Quick Check:

Solution

Step 1: Recall correct sklearn import path

Step 2: Check syntax correctness

Final Answer:

Quick Check:

Solution

Step 1: Count unique words in sentences

Step 2: Understand shape of output matrix

Final Answer:

Quick Check:

Solution

Step 1: Check method usage for feature names

Step 2: Use updated method

Final Answer:

Quick Check:

Solution

Step 1: Understand the goal of reducing common word impact

Step 2: Identify method that weighs words by importance

Final Answer:

Quick Check: