0
0
NLPml~5 mins

Bag of Words (CountVectorizer) in NLP

Choose your learning style9 modes available
Introduction

Bag of Words helps computers understand text by counting how many times each word appears. It turns words into numbers so machines can learn from text.

You want to find out what words are common in customer reviews.
You need to prepare text data for a spam email detector.
You want to compare documents by their word content.
You are building a simple text classifier like sentiment analysis.
You want to convert text into numbers for machine learning models.
Syntax
NLP
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(list_of_texts)

# To see the words (features):
words = vectorizer.get_feature_names_out()

# To see the counts:
counts = X.toarray()

fit_transform learns the words and counts them in the texts.

get_feature_names_out shows the list of words found.

Examples
This example counts words in two sentences and shows the word list and counts.
NLP
texts = ["I love cats", "Cats are great"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
print(vectorizer.get_feature_names_out())
print(X.toarray())
Shows how repeated words are counted in each text.
NLP
texts = ["apple apple orange", "banana apple"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
print(vectorizer.get_feature_names_out())
print(X.toarray())
Sample Model

This program turns four sentences into a matrix of word counts. It prints the words found and the count of each word per sentence.

NLP
from sklearn.feature_extraction.text import CountVectorizer

texts = [
    "I love machine learning",
    "Machine learning is fun",
    "I love coding in Python",
    "Python coding is great"
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

words = vectorizer.get_feature_names_out()
counts = X.toarray()

print("Words found:", words)
print("Counts matrix:")
for i, text in enumerate(texts):
    print(f"Text {i+1}: '{text}'")
    print(counts[i])
OutputSuccess
Important Notes

CountVectorizer ignores punctuation and lowercases words by default.

Stop words (common words like 'the', 'is') can be removed by setting stop_words='english'.

The output is a sparse matrix; converting to array shows full counts but can use more memory.

Summary

Bag of Words counts how often each word appears in text.

CountVectorizer turns text into numbers for machine learning.

You can see the words found and their counts per text.