0
0
NLPml~5 mins

N-grams in NLP

Choose your learning style9 modes available
Introduction

N-grams help us understand how words appear together in text. They show sequences of words to find patterns or predict the next word.

To predict the next word when typing on a phone keyboard.
To find common phrases in customer reviews.
To improve search engines by understanding word pairs or triples.
To detect spam messages by spotting unusual word combinations.
To analyze writing style by looking at frequent word sequences.
Syntax
NLP
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range=(n, n))
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())

ngram_range=(n, n) means you get only n-grams of size n (like bigrams if n=2).

You can set ngram_range=(1, 2) to get both single words and pairs.

Examples
This gets single words (unigrams) from the sentence.
NLP
from sklearn.feature_extraction.text import CountVectorizer

corpus = ['I love machine learning']
vectorizer = CountVectorizer(ngram_range=(1, 1))
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
This gets pairs of words (bigrams) from the sentence.
NLP
from sklearn.feature_extraction.text import CountVectorizer

corpus = ['I love machine learning']
vectorizer = CountVectorizer(ngram_range=(2, 2))
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
This gets both single words and pairs of words.
NLP
from sklearn.feature_extraction.text import CountVectorizer

corpus = ['I love machine learning']
vectorizer = CountVectorizer(ngram_range=(1, 2))
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
Sample Model

This code finds all pairs of words (bigrams) in the three sentences. It prints the list of bigrams and how many times each appears in each sentence.

NLP
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    'I love machine learning',
    'Machine learning is fun',
    'I love coding in Python'
]

# Create bigrams only
vectorizer = CountVectorizer(ngram_range=(2, 2))
X = vectorizer.fit_transform(corpus)

# Show the bigrams found
bigrams = vectorizer.get_feature_names_out()
print('Bigrams:', bigrams)

# Show the count matrix as array
counts = X.toarray()
print('Counts matrix:\n', counts)
OutputSuccess
Important Notes

N-grams help capture context by looking at word groups, not just single words.

Higher n (like 3 or 4) means longer word sequences but fewer matches and more data needed.

CountVectorizer automatically lowercases words and removes punctuation by default.

Summary

N-grams are groups of n words appearing together in text.

They help find patterns and improve text predictions.

Use CountVectorizer with ngram_range to extract n-grams easily.