Jump into concepts and practice - no test required
or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Recall & Review
beginner
What is an N-gram in natural language processing?
An N-gram is a sequence of N words or tokens that appear together in text. For example, a 2-gram (bigram) is two words in a row, like "good morning."
Click to reveal answer
beginner
What is the difference between a unigram, bigram, and trigram?
A unigram is a single word, a bigram is a pair of two consecutive words, and a trigram is a group of three consecutive words.
Click to reveal answer
intermediate
Why are N-grams useful in language models?
N-grams help predict the next word by looking at the previous N-1 words. This helps computers understand context and improve tasks like text prediction or spelling correction.
Click to reveal answer
intermediate
What is a limitation of using large N in N-grams?
Using large N (like 5-grams or more) can cause data sparsity, meaning many sequences appear rarely or never, making it hard for the model to learn well.
Click to reveal answer
beginner
How can N-grams be used to detect common phrases?
By counting how often N-grams appear in text, we can find common phrases or word combinations that occur frequently, like "New York City" or "machine learning."
Click to reveal answer
What does a bigram represent?
ATwo consecutive words
BOne single word
CThree consecutive words
DA sentence
✗ Incorrect
A bigram is a sequence of two consecutive words in text.
Which N-gram size is called a trigram?
A2
B3
C1
D4
✗ Incorrect
A trigram consists of three consecutive words.
What problem can happen if N is too large in N-grams?
AData sparsity
BToo many common phrases
CWords lose meaning
DModel runs faster
✗ Incorrect
Large N causes data sparsity because many sequences appear rarely, making learning difficult.
How do N-grams help in text prediction?
ABy counting characters
BBy ignoring previous words
CBy translating text
DBy looking at previous N-1 words to predict the next
✗ Incorrect
N-grams use the previous N-1 words to predict the next word, helping with text prediction.
Which of these is an example of a unigram?
A"machine learning"
B"learning model"
C"machine"
D"deep neural network"
✗ Incorrect
A unigram is a single word, like "machine."
Explain what N-grams are and how they are used in language processing.
Think about sequences of words and how they help computers understand text.
You got /3 concepts.
Describe one advantage and one limitation of using N-grams in text analysis.
Consider what N-grams help with and what problems happen when N grows.
You got /2 concepts.
Practice
(1/5)
1. What is an n-gram in natural language processing?
easy
A. A random selection of n words from a text
B. A single word repeated n times
C. A sentence with n words
D. A group of n consecutive words in a text
Solution
Step 1: Understand the definition of n-gram
An n-gram is defined as a sequence of n consecutive words appearing together in text.
Step 2: Compare options with definition
Only A group of n consecutive words in a text correctly describes an n-gram as consecutive words, not random or repeated words.
Final Answer:
A group of n consecutive words in a text -> Option D
Quick Check:
n-gram = consecutive words [OK]
Hint: Remember: n-gram means consecutive words, not random ones [OK]
Common Mistakes:
Thinking n-gram means repeated words
Confusing n-gram with sentence length
Assuming words are randomly picked
2. Which of the following is the correct way to set up a CountVectorizer to extract bigrams in Python?
easy
A. CountVectorizer(ngram_range=(1,1))
B. CountVectorizer(ngram_range=(2,2))
C. CountVectorizer(ngram_range=(0,2))
D. CountVectorizer(ngram_range=(1,3))
Solution
Step 1: Understand ngram_range parameter
ngram_range=(2,2) extracts only bigrams (groups of exactly 2 words).
Step 2: Evaluate each option
CountVectorizer(ngram_range=(1,1)) extracts unigrams only; C is invalid because 0 is not a valid n; D extracts unigrams to trigrams.
Final Answer:
CountVectorizer(ngram_range=(2,2)) -> Option B
Quick Check:
bigrams = ngram_range (2,2) [OK]
Hint: Set ngram_range=(2,2) for only bigrams [OK]
Common Mistakes:
Using (1,1) which extracts unigrams
Using (0,2) which is invalid
Using (1,3) which extracts multiple n-grams
3. What will be the output tokens when extracting trigrams from the sentence 'I love machine learning' using CountVectorizer(ngram_range=(3,3))?
medium
A. ['I love machine', 'love machine learning']
B. ['I love', 'love machine', 'machine learning']
C. ['I', 'love', 'machine', 'learning']
D. ['I love machine learning']
Solution
Step 1: Understand trigram extraction
Trigrams are groups of 3 consecutive words. The sentence has 4 words, so possible trigrams are words 1-3 and 2-4.
Step 2: List trigrams from the sentence
First trigram: 'I love machine', second trigram: 'love machine learning'.
Final Answer:
['I love machine', 'love machine learning'] -> Option A
Quick Check:
Trigrams = groups of 3 words [OK]
Hint: Count groups of 3 consecutive words for trigrams [OK]
Common Mistakes:
Listing bigrams instead of trigrams
Listing single words instead of groups
Combining all words as one token
4. Identify the error in this code snippet for extracting bigrams:
from sklearn.feature_extraction.text import CountVectorizer
text = ['hello world']
vectorizer = CountVectorizer(ngram_range=(1,2))
vectorizer.fit_transform(text)
print(vectorizer.get_feature_names())
medium
A. The text should be a string, not a list
B. The ngram_range should be (2,2) to extract only bigrams
C. The method get_feature_names() is deprecated and should be get_feature_names_out()
D. CountVectorizer cannot extract bigrams
Solution
Step 1: Check method usage
In recent sklearn versions, get_feature_names() is deprecated; get_feature_names_out() is the correct method.
Step 2: Validate other parts
ngram_range=(1,2) is valid for unigrams and bigrams; text as list is correct; CountVectorizer supports bigrams.
Final Answer:
get_feature_names() is deprecated and should be get_feature_names_out() -> Option C
Quick Check:
Use get_feature_names_out() for features [OK]
Hint: Use get_feature_names_out() instead of deprecated get_feature_names() [OK]
Common Mistakes:
Thinking ngram_range=(1,2) is wrong for bigrams
Assuming text must be a string, not list
Believing CountVectorizer can't extract bigrams
5. You want to build a text prediction model that uses both unigrams and bigrams but excludes any n-grams containing stop words like 'the' or 'and'. Which approach is best?
hard
A. Use CountVectorizer with ngram_range=(1,2) and stop_words='english'
B. Use CountVectorizer with ngram_range=(2,2) and no stop words removal
C. Use CountVectorizer with ngram_range=(1,1) and manually remove stop words after extraction
D. Use CountVectorizer with ngram_range=(1,3) and stop_words=None
Solution
Step 1: Understand requirements
We need unigrams and bigrams, and want to exclude stop words in any n-gram.
Step 2: Evaluate options
Use CountVectorizer with ngram_range=(1,2) and stop_words='english' uses ngram_range=(1,2) for unigrams and bigrams and removes stop words automatically. Others either miss unigrams, include stop words, or include trigrams.
Final Answer:
Use CountVectorizer with ngram_range=(1,2) and stop_words='english' -> Option A
Quick Check:
Unigrams + bigrams + stop word removal = Use CountVectorizer with ngram_range=(1,2) and stop_words='english' [OK]
Hint: Set ngram_range and stop_words='english' to filter stop words [OK]