What is Bag of Words (CountVectorizer) in NLP?

NLPml~5 mins

Bag of Words (CountVectorizer) in NLP

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Introduction

Bag of Words helps computers understand text by counting how many times each word appears. It turns words into numbers so machines can learn from text.

You want to find out what words are common in customer reviews.

You need to prepare text data for a spam email detector.

You want to compare documents by their word content.

You are building a simple text classifier like sentiment analysis.

You want to convert text into numbers for machine learning models.

Syntax

NLP

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(list_of_texts)

# To see the words (features):
words = vectorizer.get_feature_names_out()

# To see the counts:
counts = X.toarray()

fit_transform learns the words and counts them in the texts.

get_feature_names_out shows the list of words found.

Examples

This example counts words in two sentences and shows the word list and counts.

NLP

texts = ["I love cats", "Cats are great"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
print(vectorizer.get_feature_names_out())
print(X.toarray())

Shows how repeated words are counted in each text.

NLP

texts = ["apple apple orange", "banana apple"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
print(vectorizer.get_feature_names_out())
print(X.toarray())

Sample Model

This program turns four sentences into a matrix of word counts. It prints the words found and the count of each word per sentence.

NLP

from sklearn.feature_extraction.text import CountVectorizer

texts = [
    "I love machine learning",
    "Machine learning is fun",
    "I love coding in Python",
    "Python coding is great"
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

words = vectorizer.get_feature_names_out()
counts = X.toarray()

print("Words found:", words)
print("Counts matrix:")
for i, text in enumerate(texts):
    print(f"Text {i+1}: '{text}'")
    print(counts[i])

OutputSuccess

Important Notes

CountVectorizer ignores punctuation and lowercases words by default.

Stop words (common words like 'the', 'is') can be removed by setting stop_words='english'.

The output is a sparse matrix; converting to array shows full counts but can use more memory.

Summary

Bag of Words counts how often each word appears in text.

CountVectorizer turns text into numbers for machine learning.

You can see the words found and their counts per text.

Practice

(1/5)

1. What does the Bag of Words model do in text processing?

easy

A. Counts how often each word appears in the text

B. Translates text into another language

C. Removes all punctuation from the text

D. Generates summaries of the text

Bag of Words (CountVectorizer) in NLP

Start learning this pattern below

Practice

Solution

Step 1: Understand Bag of Words purpose

Step 2: Compare options to definition

Final Answer:

Quick Check:

Solution

Step 1: Recall correct import path

Step 2: Match options to correct syntax

Final Answer:

Quick Check:

Solution

Step 1: Identify unique words

Step 2: Count sentences and features

Final Answer:

Quick Check:

Solution

Step 1: Identify deprecated method

Step 2: Use correct method

Final Answer:

Quick Check:

Solution

Step 1: Understand max_df parameter

Step 2: Compare other options

Final Answer:

Quick Check: