N-grams help us understand how words appear together in text. They show sequences of words to find patterns or predict the next word.
0
0
N-grams in NLP
Introduction
To predict the next word when typing on a phone keyboard.
To find common phrases in customer reviews.
To improve search engines by understanding word pairs or triples.
To detect spam messages by spotting unusual word combinations.
To analyze writing style by looking at frequent word sequences.
Syntax
NLP
from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer(ngram_range=(n, n)) X = vectorizer.fit_transform(corpus) print(vectorizer.get_feature_names_out())
ngram_range=(n, n) means you get only n-grams of size n (like bigrams if n=2).
You can set ngram_range=(1, 2) to get both single words and pairs.
Examples
This gets single words (unigrams) from the sentence.
NLP
from sklearn.feature_extraction.text import CountVectorizer corpus = ['I love machine learning'] vectorizer = CountVectorizer(ngram_range=(1, 1)) X = vectorizer.fit_transform(corpus) print(vectorizer.get_feature_names_out())
This gets pairs of words (bigrams) from the sentence.
NLP
from sklearn.feature_extraction.text import CountVectorizer corpus = ['I love machine learning'] vectorizer = CountVectorizer(ngram_range=(2, 2)) X = vectorizer.fit_transform(corpus) print(vectorizer.get_feature_names_out())
This gets both single words and pairs of words.
NLP
from sklearn.feature_extraction.text import CountVectorizer corpus = ['I love machine learning'] vectorizer = CountVectorizer(ngram_range=(1, 2)) X = vectorizer.fit_transform(corpus) print(vectorizer.get_feature_names_out())
Sample Model
This code finds all pairs of words (bigrams) in the three sentences. It prints the list of bigrams and how many times each appears in each sentence.
NLP
from sklearn.feature_extraction.text import CountVectorizer corpus = [ 'I love machine learning', 'Machine learning is fun', 'I love coding in Python' ] # Create bigrams only vectorizer = CountVectorizer(ngram_range=(2, 2)) X = vectorizer.fit_transform(corpus) # Show the bigrams found bigrams = vectorizer.get_feature_names_out() print('Bigrams:', bigrams) # Show the count matrix as array counts = X.toarray() print('Counts matrix:\n', counts)
OutputSuccess
Important Notes
N-grams help capture context by looking at word groups, not just single words.
Higher n (like 3 or 4) means longer word sequences but fewer matches and more data needed.
CountVectorizer automatically lowercases words and removes punctuation by default.
Summary
N-grams are groups of n words appearing together in text.
They help find patterns and improve text predictions.
Use CountVectorizer with ngram_range to extract n-grams easily.