Bag of Words helps computers understand text by counting how many times each word appears. It turns words into numbers so machines can learn from text.
Bag of Words (CountVectorizer) in NLP
Start learning this pattern below
Jump into concepts and practice - no test required
or
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Introduction
Syntax
NLP
from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() X = vectorizer.fit_transform(list_of_texts) # To see the words (features): words = vectorizer.get_feature_names_out() # To see the counts: counts = X.toarray()
fit_transform learns the words and counts them in the texts.
get_feature_names_out shows the list of words found.
Examples
NLP
texts = ["I love cats", "Cats are great"] vectorizer = CountVectorizer() X = vectorizer.fit_transform(texts) print(vectorizer.get_feature_names_out()) print(X.toarray())
NLP
texts = ["apple apple orange", "banana apple"] vectorizer = CountVectorizer() X = vectorizer.fit_transform(texts) print(vectorizer.get_feature_names_out()) print(X.toarray())
Sample Model
This program turns four sentences into a matrix of word counts. It prints the words found and the count of each word per sentence.
NLP
from sklearn.feature_extraction.text import CountVectorizer texts = [ "I love machine learning", "Machine learning is fun", "I love coding in Python", "Python coding is great" ] vectorizer = CountVectorizer() X = vectorizer.fit_transform(texts) words = vectorizer.get_feature_names_out() counts = X.toarray() print("Words found:", words) print("Counts matrix:") for i, text in enumerate(texts): print(f"Text {i+1}: '{text}'") print(counts[i])
Important Notes
CountVectorizer ignores punctuation and lowercases words by default.
Stop words (common words like 'the', 'is') can be removed by setting stop_words='english'.
The output is a sparse matrix; converting to array shows full counts but can use more memory.
Summary
Bag of Words counts how often each word appears in text.
CountVectorizer turns text into numbers for machine learning.
You can see the words found and their counts per text.
Practice
1. What does the Bag of Words model do in text processing?
easy
Solution
Step 1: Understand Bag of Words purpose
Bag of Words counts the frequency of each word in a text, ignoring order.Step 2: Compare options to definition
Only Counts how often each word appears in the text matches this description exactly.Final Answer:
Counts how often each word appears in the text -> Option AQuick Check:
Bag of Words = Counts words [OK]
Hint: Bag of Words counts words, not translates or summarizes [OK]
Common Mistakes:
- Confusing Bag of Words with translation
- Thinking it removes punctuation only
- Assuming it summarizes text
2. Which of the following is the correct way to import CountVectorizer from scikit-learn in Python?
easy
Solution
Step 1: Recall correct import path
CountVectorizer is in sklearn.feature_extraction.text module.Step 2: Match options to correct syntax
Only from sklearn.feature_extraction.text import CountVectorizer uses the correct 'from ... import ...' syntax and correct module path.Final Answer:
from sklearn.feature_extraction.text import CountVectorizer -> Option BQuick Check:
Correct import path = from sklearn.feature_extraction.text import CountVectorizer [OK]
Hint: CountVectorizer is in sklearn.feature_extraction.text [OK]
Common Mistakes:
- Using wrong module path
- Incorrect import syntax
- Trying to import from sklearn.text
3. What will be the output shape of the matrix after applying CountVectorizer on these two sentences:
['I love cats', 'Cats love me']?medium
Solution
Step 1: Identify unique words
Words are: 'I', 'love', 'cats', 'me' (case insensitive, 'Cats' and 'cats' same).Step 2: Count sentences and features
There are 2 sentences and 4 unique words, so matrix shape is (2, 4).Final Answer:
(2, 4) -> Option DQuick Check:
2 sentences, 4 words = (2, 4) [OK]
Hint: Count unique words and sentences for shape (rows, columns) [OK]
Common Mistakes:
- Counting words per sentence instead of unique words
- Mixing rows and columns in shape
- Ignoring case sensitivity
4. The following code throws an error. What is the mistake?
from sklearn.feature_extraction.text import CountVectorizer texts = ['hello world', 'hello'] vectorizer = CountVectorizer() X = vectorizer.fit_transform(texts) print(X.toarray()) print(vectorizer.get_feature_names())
medium
Solution
Step 1: Identify deprecated method
get_feature_names() is deprecated in recent sklearn versions.Step 2: Use correct method
Replace get_feature_names() with get_feature_names_out() to fix error.Final Answer:
get_feature_names() is deprecated, should use get_feature_names_out() -> Option AQuick Check:
Use get_feature_names_out() not get_feature_names() [OK]
Hint: Use get_feature_names_out() instead of deprecated get_feature_names() [OK]
Common Mistakes:
- Thinking fit_transform() is wrong
- Assuming toarray() is invalid
- Believing CountVectorizer needs language parameter
5. You have a list of sentences with some words repeated many times. How can you use CountVectorizer to ignore words that appear in more than 50% of the sentences?
hard
Solution
Step 1: Understand max_df parameter
max_df=0.5 tells CountVectorizer to ignore words in more than 50% of documents.Step 2: Compare other options
min_df controls minimum frequency, stop_words removes common English words, max_features limits number of features, none ignore frequent words by percentage.Final Answer:
Set the parameter max_df=0.5 when creating CountVectorizer -> Option CQuick Check:
max_df filters frequent words by document frequency [OK]
Hint: Use max_df to exclude very common words [OK]
Common Mistakes:
- Confusing max_df with min_df
- Thinking stop_words removes all frequent words
- Using max_features to filter frequency
