N-grams help us understand how words appear together in text. They show sequences of words to find patterns or predict the next word.
N-grams in NLP
Start learning this pattern below
Jump into concepts and practice - no test required
from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer(ngram_range=(n, n)) X = vectorizer.fit_transform(corpus) print(vectorizer.get_feature_names_out())
ngram_range=(n, n) means you get only n-grams of size n (like bigrams if n=2).
You can set ngram_range=(1, 2) to get both single words and pairs.
from sklearn.feature_extraction.text import CountVectorizer corpus = ['I love machine learning'] vectorizer = CountVectorizer(ngram_range=(1, 1)) X = vectorizer.fit_transform(corpus) print(vectorizer.get_feature_names_out())
from sklearn.feature_extraction.text import CountVectorizer corpus = ['I love machine learning'] vectorizer = CountVectorizer(ngram_range=(2, 2)) X = vectorizer.fit_transform(corpus) print(vectorizer.get_feature_names_out())
from sklearn.feature_extraction.text import CountVectorizer corpus = ['I love machine learning'] vectorizer = CountVectorizer(ngram_range=(1, 2)) X = vectorizer.fit_transform(corpus) print(vectorizer.get_feature_names_out())
This code finds all pairs of words (bigrams) in the three sentences. It prints the list of bigrams and how many times each appears in each sentence.
from sklearn.feature_extraction.text import CountVectorizer corpus = [ 'I love machine learning', 'Machine learning is fun', 'I love coding in Python' ] # Create bigrams only vectorizer = CountVectorizer(ngram_range=(2, 2)) X = vectorizer.fit_transform(corpus) # Show the bigrams found bigrams = vectorizer.get_feature_names_out() print('Bigrams:', bigrams) # Show the count matrix as array counts = X.toarray() print('Counts matrix:\n', counts)
N-grams help capture context by looking at word groups, not just single words.
Higher n (like 3 or 4) means longer word sequences but fewer matches and more data needed.
CountVectorizer automatically lowercases words and removes punctuation by default.
N-grams are groups of n words appearing together in text.
They help find patterns and improve text predictions.
Use CountVectorizer with ngram_range to extract n-grams easily.
Practice
Solution
Step 1: Understand the definition of n-gram
An n-gram is defined as a sequence of n consecutive words appearing together in text.Step 2: Compare options with definition
Only A group of n consecutive words in a text correctly describes an n-gram as consecutive words, not random or repeated words.Final Answer:
A group of n consecutive words in a text -> Option DQuick Check:
n-gram = consecutive words [OK]
- Thinking n-gram means repeated words
- Confusing n-gram with sentence length
- Assuming words are randomly picked
Solution
Step 1: Understand ngram_range parameter
ngram_range=(2,2) extracts only bigrams (groups of exactly 2 words).Step 2: Evaluate each option
CountVectorizer(ngram_range=(1,1)) extracts unigrams only; C is invalid because 0 is not a valid n; D extracts unigrams to trigrams.Final Answer:
CountVectorizer(ngram_range=(2,2)) -> Option BQuick Check:
bigrams = ngram_range (2,2) [OK]
- Using (1,1) which extracts unigrams
- Using (0,2) which is invalid
- Using (1,3) which extracts multiple n-grams
'I love machine learning' using CountVectorizer(ngram_range=(3,3))?Solution
Step 1: Understand trigram extraction
Trigrams are groups of 3 consecutive words. The sentence has 4 words, so possible trigrams are words 1-3 and 2-4.Step 2: List trigrams from the sentence
First trigram: 'I love machine', second trigram: 'love machine learning'.Final Answer:
['I love machine', 'love machine learning'] -> Option AQuick Check:
Trigrams = groups of 3 words [OK]
- Listing bigrams instead of trigrams
- Listing single words instead of groups
- Combining all words as one token
from sklearn.feature_extraction.text import CountVectorizer text = ['hello world'] vectorizer = CountVectorizer(ngram_range=(1,2)) vectorizer.fit_transform(text) print(vectorizer.get_feature_names())
Solution
Step 1: Check method usage
In recent sklearn versions, get_feature_names() is deprecated; get_feature_names_out() is the correct method.Step 2: Validate other parts
ngram_range=(1,2) is valid for unigrams and bigrams; text as list is correct; CountVectorizer supports bigrams.Final Answer:
get_feature_names() is deprecated and should be get_feature_names_out() -> Option CQuick Check:
Use get_feature_names_out() for features [OK]
- Thinking ngram_range=(1,2) is wrong for bigrams
- Assuming text must be a string, not list
- Believing CountVectorizer can't extract bigrams
Solution
Step 1: Understand requirements
We need unigrams and bigrams, and want to exclude stop words in any n-gram.Step 2: Evaluate options
Use CountVectorizer with ngram_range=(1,2) and stop_words='english' uses ngram_range=(1,2) for unigrams and bigrams and removes stop words automatically. Others either miss unigrams, include stop words, or include trigrams.Final Answer:
Use CountVectorizer with ngram_range=(1,2) and stop_words='english' -> Option AQuick Check:
Unigrams + bigrams + stop word removal = Use CountVectorizer with ngram_range=(1,2) and stop_words='english' [OK]
- Not removing stop words from bigrams
- Using wrong ngram_range missing unigrams
- Including trigrams when not needed
