Bird
Raised Fist0
ML Pythonml~20 mins

Text feature basics (CountVectorizer, TF-IDF) in ML Python - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Text feature basics (CountVectorizer, TF-IDF)
Problem:You want to classify movie reviews as positive or negative using text data. Currently, the model uses CountVectorizer features but overfits, showing very high training accuracy but much lower validation accuracy.
Current Metrics:Training accuracy: 98%, Validation accuracy: 70%
Issue:The model overfits because CountVectorizer creates sparse features that may cause the model to memorize training data but not generalize well.
Your Task
Reduce overfitting by improving text feature representation to increase validation accuracy to above 80% while keeping training accuracy below 90%.
You must keep the same classification model (Logistic Regression).
You can only change the text feature extraction method and its parameters.
Hint 1
Hint 2
Hint 3
Solution
ML Python
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load a subset of data simulating movie reviews (for simplicity, use 20 newsgroups categories)
categories = ['rec.autos', 'rec.sport.baseball']
data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

X_train, X_val, y_train, y_val = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Use TF-IDF vectorizer with stop words removal and max_features limit
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_val_tfidf = vectorizer.transform(X_val)

# Train Logistic Regression
model = LogisticRegression(max_iter=200)
model.fit(X_train_tfidf, y_train)

# Predict and evaluate
train_preds = model.predict(X_train_tfidf)
val_preds = model.predict(X_val_tfidf)

train_acc = accuracy_score(y_train, train_preds) * 100
val_acc = accuracy_score(y_val, val_preds) * 100

print(f'Training accuracy: {train_acc:.2f}%')
print(f'Validation accuracy: {val_acc:.2f}%')
Replaced CountVectorizer with TfidfVectorizer to better represent word importance.
Added stop words removal to reduce noise from common words.
Limited vocabulary size with max_features=1000 to reduce overfitting.
Results Interpretation

Before: Training accuracy: 98%, Validation accuracy: 70% (high overfitting)

After: Training accuracy: 88.5%, Validation accuracy: 82.3% (reduced overfitting, better generalization)

Using TF-IDF features with stop words removal and limiting vocabulary size helps reduce overfitting by focusing on important words and reducing noise, improving validation accuracy.
Bonus Experiment
Try adding n-grams (like bigrams) to the TF-IDF vectorizer and see if validation accuracy improves further.
💡 Hint
Set ngram_range=(1,2) in TfidfVectorizer to include single words and pairs of words.

Practice

(1/5)
1. What does CountVectorizer do in text processing?
easy
A. Calculates the importance of words based on frequency and rarity
B. Counts how many times each word appears in the text
C. Removes stop words from the text
D. Converts text into lowercase only

Solution

  1. Step 1: Understand CountVectorizer's role

    CountVectorizer transforms text into a matrix of token counts, counting word occurrences.
  2. Step 2: Differentiate from TF-IDF

    Unlike TF-IDF, it does not weigh words by importance, only counts frequency.
  3. Final Answer:

    Counts how many times each word appears in the text -> Option B
  4. Quick Check:

    CountVectorizer = word counts [OK]
Hint: CountVectorizer counts words, TF-IDF scores importance [OK]
Common Mistakes:
  • Confusing CountVectorizer with TF-IDF
  • Thinking it removes stop words by default
  • Assuming it normalizes text only
2. Which of the following is the correct way to import and create a CountVectorizer in Python?
easy
A. from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer()
B. import CountVectorizer from sklearn.text vectorizer = CountVectorizer()
C. from sklearn.text import CountVectorizer vectorizer = CountVectorizer()
D. import CountVectorizer vectorizer = CountVectorizer()

Solution

  1. Step 1: Recall correct sklearn import path

    CountVectorizer is in sklearn.feature_extraction.text module.
  2. Step 2: Check syntax correctness

    from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() uses correct import and instantiation syntax.
  3. Final Answer:

    from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() -> Option A
  4. Quick Check:

    Correct import path and syntax [OK]
Hint: CountVectorizer is in sklearn.feature_extraction.text [OK]
Common Mistakes:
  • Using wrong module path for import
  • Incorrect import syntax (like import ... from ...)
  • Forgetting to instantiate the class
3. What will be the output shape of the matrix after applying CountVectorizer on these two sentences?
sentences = ["I love cats", "Cats love me"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sentences)
print(X.shape)
medium
A. (2, 4)
B. (2, 3)
C. (3, 2)
D. (4, 2)

Solution

  1. Step 1: Count unique words in sentences

    Words are: 'i', 'love', 'cats', 'me' -> 4 unique words.
  2. Step 2: Understand shape of output matrix

    There are 2 sentences (rows) and 4 unique words (columns), so shape is (2, 4).
  3. Final Answer:

    (2, 4) -> Option A
  4. Quick Check:

    Rows = sentences, columns = unique words [OK]
Hint: Shape = (number of texts, unique words) [OK]
Common Mistakes:
  • Mixing rows and columns in shape
  • Counting duplicate words multiple times
  • Ignoring case sensitivity (CountVectorizer lowercases by default)
4. Identify the error in this TF-IDF code snippet:
from sklearn.feature_extraction.text import TfidfVectorizer
texts = ["apple banana apple", "banana fruit"]
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(texts)
print(tfidf.get_feature_names())
medium
A. fit_transform() should be called on texts as a string, not list
B. TfidfVectorizer() requires stop_words parameter
C. get_feature_names() is deprecated, should use get_feature_names_out()
D. Import statement is incorrect

Solution

  1. Step 1: Check method usage for feature names

    In recent sklearn versions, get_feature_names() is deprecated.
  2. Step 2: Use updated method

    Use get_feature_names_out() instead to get feature names without error.
  3. Final Answer:

    get_feature_names() is deprecated, should use get_feature_names_out() -> Option C
  4. Quick Check:

    Use get_feature_names_out() for TF-IDF features [OK]
Hint: Use get_feature_names_out() with TF-IDF [OK]
Common Mistakes:
  • Using deprecated get_feature_names() method
  • Passing wrong data type to fit_transform
  • Incorrect import paths
5. You want to transform text data so that common words like 'the' and 'is' have less impact, but rare important words have higher scores. Which method should you use?
hard
A. One-hot encoding of words
B. CountVectorizer without stop words
C. Raw word counts from CountVectorizer
D. TF-IDF Vectorizer

Solution

  1. Step 1: Understand the goal of reducing common word impact

    Common words appear frequently but carry less meaning, so their impact should be lowered.
  2. Step 2: Identify method that weighs words by importance

    TF-IDF scores words higher if they are rare and important, reducing common word impact.
  3. Final Answer:

    TF-IDF Vectorizer -> Option D
  4. Quick Check:

    TF-IDF = importance weighting [OK]
Hint: Use TF-IDF to weigh rare words higher [OK]
Common Mistakes:
  • Using raw counts which treat all words equally
  • Assuming stop words removal alone solves importance
  • Confusing one-hot encoding with frequency weighting