Bird
Raised Fist0
NLPml~8 mins

Bag of Words (CountVectorizer) in NLP - Model Metrics & Evaluation

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Metrics & Evaluation - Bag of Words (CountVectorizer)
Which metric matters for Bag of Words (CountVectorizer) and WHY

When using Bag of Words with CountVectorizer, the main goal is to convert text into numbers for models. The quality of this conversion affects how well the model learns. Metrics like accuracy, precision, and recall matter because they show how well the model understands the text features created by CountVectorizer.

For example, if you use Bag of Words for spam detection, precision tells you how many emails marked as spam really are spam, and recall tells you how many spam emails you caught. These metrics help check if the word counts are helping the model make good decisions.

Confusion Matrix Example
      Actual \ Predicted | Spam | Not Spam
      -------------------|-------|---------
      Spam               |  80   |   20
      Not Spam           |  10   |   90
    

Here, TP=80 (spam correctly found), FP=10 (not spam wrongly marked spam), FN=20 (spam missed), TN=90 (not spam correctly found).

Precision = 80 / (80 + 10) = 0.89
Recall = 80 / (80 + 20) = 0.80

Precision vs Recall Tradeoff

With Bag of Words, sometimes the model finds many spam emails but also marks some good emails as spam (high recall, low precision). Or it marks only very sure spam emails but misses some (high precision, low recall).

For spam filters, high precision is important to avoid losing good emails. For medical text classification, high recall is important to catch all cases.

Good vs Bad Metric Values

Good: Precision and recall both above 0.8 means the Bag of Words features help the model find relevant text patterns well.

Bad: Precision or recall below 0.5 means the word counts are not helping the model distinguish classes well. This could be due to poor vocabulary choice or noisy text.

Common Pitfalls
  • Ignoring stop words: Common words like "the" or "and" can add noise if not removed.
  • High dimensionality: Bag of Words can create many features, causing overfitting if the dataset is small.
  • Data leakage: Using test data to build vocabulary can inflate metrics falsely.
  • Accuracy paradox: High accuracy can be misleading if classes are imbalanced.
Self Check

Your model using Bag of Words has 98% accuracy but only 12% recall on spam emails. Is it good?

Answer: No. The model misses most spam emails (low recall). Even with high accuracy, it fails to catch the important class. You need to improve recall by adjusting features or model.

Key Result
Precision and recall are key to evaluate Bag of Words effectiveness; high accuracy alone can be misleading.

Practice

(1/5)
1. What does the Bag of Words model do in text processing?
easy
A. Counts how often each word appears in the text
B. Translates text into another language
C. Removes all punctuation from the text
D. Generates summaries of the text

Solution

  1. Step 1: Understand Bag of Words purpose

    Bag of Words counts the frequency of each word in a text, ignoring order.
  2. Step 2: Compare options to definition

    Only Counts how often each word appears in the text matches this description exactly.
  3. Final Answer:

    Counts how often each word appears in the text -> Option A
  4. Quick Check:

    Bag of Words = Counts words [OK]
Hint: Bag of Words counts words, not translates or summarizes [OK]
Common Mistakes:
  • Confusing Bag of Words with translation
  • Thinking it removes punctuation only
  • Assuming it summarizes text
2. Which of the following is the correct way to import CountVectorizer from scikit-learn in Python?
easy
A. import CountVectorizer from sklearn.feature_extraction
B. from sklearn.feature_extraction.text import CountVectorizer
C. from sklearn.text import CountVectorizer
D. import CountVectorizer from sklearn.text

Solution

  1. Step 1: Recall correct import path

    CountVectorizer is in sklearn.feature_extraction.text module.
  2. Step 2: Match options to correct syntax

    Only from sklearn.feature_extraction.text import CountVectorizer uses the correct 'from ... import ...' syntax and correct module path.
  3. Final Answer:

    from sklearn.feature_extraction.text import CountVectorizer -> Option B
  4. Quick Check:

    Correct import path = from sklearn.feature_extraction.text import CountVectorizer [OK]
Hint: CountVectorizer is in sklearn.feature_extraction.text [OK]
Common Mistakes:
  • Using wrong module path
  • Incorrect import syntax
  • Trying to import from sklearn.text
3. What will be the output shape of the matrix after applying CountVectorizer on these two sentences:
['I love cats', 'Cats love me']?
medium
A. (3, 2)
B. (2, 3)
C. (4, 2)
D. (2, 4)

Solution

  1. Step 1: Identify unique words

    Words are: 'I', 'love', 'cats', 'me' (case insensitive, 'Cats' and 'cats' same).
  2. Step 2: Count sentences and features

    There are 2 sentences and 4 unique words, so matrix shape is (2, 4).
  3. Final Answer:

    (2, 4) -> Option D
  4. Quick Check:

    2 sentences, 4 words = (2, 4) [OK]
Hint: Count unique words and sentences for shape (rows, columns) [OK]
Common Mistakes:
  • Counting words per sentence instead of unique words
  • Mixing rows and columns in shape
  • Ignoring case sensitivity
4. The following code throws an error. What is the mistake?
from sklearn.feature_extraction.text import CountVectorizer
texts = ['hello world', 'hello']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
print(X.toarray())
print(vectorizer.get_feature_names())
medium
A. get_feature_names() is deprecated, should use get_feature_names_out()
B. fit_transform() should be fit_transform_text()
C. toarray() is not a method of X
D. CountVectorizer() needs a parameter for language

Solution

  1. Step 1: Identify deprecated method

    get_feature_names() is deprecated in recent sklearn versions.
  2. Step 2: Use correct method

    Replace get_feature_names() with get_feature_names_out() to fix error.
  3. Final Answer:

    get_feature_names() is deprecated, should use get_feature_names_out() -> Option A
  4. Quick Check:

    Use get_feature_names_out() not get_feature_names() [OK]
Hint: Use get_feature_names_out() instead of deprecated get_feature_names() [OK]
Common Mistakes:
  • Thinking fit_transform() is wrong
  • Assuming toarray() is invalid
  • Believing CountVectorizer needs language parameter
5. You have a list of sentences with some words repeated many times. How can you use CountVectorizer to ignore words that appear in more than 50% of the sentences?
hard
A. Set min_df=0.5 to ignore frequent words
B. Use stop_words='english' to remove frequent words
C. Set the parameter max_df=0.5 when creating CountVectorizer
D. Set max_features=0.5 to limit word count

Solution

  1. Step 1: Understand max_df parameter

    max_df=0.5 tells CountVectorizer to ignore words in more than 50% of documents.
  2. Step 2: Compare other options

    min_df controls minimum frequency, stop_words removes common English words, max_features limits number of features, none ignore frequent words by percentage.
  3. Final Answer:

    Set the parameter max_df=0.5 when creating CountVectorizer -> Option C
  4. Quick Check:

    max_df filters frequent words by document frequency [OK]
Hint: Use max_df to exclude very common words [OK]
Common Mistakes:
  • Confusing max_df with min_df
  • Thinking stop_words removes all frequent words
  • Using max_features to filter frequency