Bird
Raised Fist0
NLPml~20 mins

N-grams in NLP - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
N-grams Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
What is the output of this code generating bigrams?
Given the code below that generates bigrams from a sentence, what is the output?
NLP
sentence = "I love machine learning"
words = sentence.split()
bigrams = [(words[i], words[i+1]) for i in range(len(words)-1)]
print(bigrams)
A[('I', 'love'), ('love', 'machine'), ('learning', 'machine')]
B[('I', 'love'), ('love', 'machine'), ('machine', 'learning')]
C[('love', 'I'), ('machine', 'love'), ('learning', 'machine')]
D[('I', 'machine'), ('love', 'learning')]
Attempts:
2 left
💡 Hint
Remember bigrams are pairs of consecutive words.
🧠 Conceptual
intermediate
1:30remaining
Which statement best describes the purpose of n-grams in text processing?
Choose the best description of why n-grams are used in natural language processing.
AThey capture sequences of words to understand context and word order.
BThey remove stop words to reduce noise in text data.
CThey translate text from one language to another.
DThey count the total number of characters in a text.
Attempts:
2 left
💡 Hint
Think about how n-grams help capture word relationships.
Hyperparameter
advanced
2:00remaining
Choosing the right n for n-grams
If you want to capture longer phrases but avoid very sparse data, which n-gram size is usually the best choice?
AUnigrams (n=1) because they are simple and cover all words.
BFour-grams (n=4) because they capture the most detailed phrases.
CBigrams (n=2) because they balance context and data sparsity.
DTrigrams (n=3) because longer sequences always improve accuracy.
Attempts:
2 left
💡 Hint
Longer n-grams capture more context but can cause data sparsity.
Metrics
advanced
1:30remaining
Evaluating n-gram language models
Which metric is commonly used to evaluate the quality of an n-gram language model?
APerplexity, which measures how well the model predicts a sample.
BAccuracy, which counts correct word predictions only.
CMean Squared Error, used for regression tasks.
DF1 Score, used for classification balance.
Attempts:
2 left
💡 Hint
This metric measures uncertainty in predicting text sequences.
🔧 Debug
expert
2:30remaining
Why does this n-gram code raise an error?
Consider this code snippet to generate trigrams. Why does it raise an IndexError?
NLP
sentence = "Data science is fun"
words = sentence.split()
trigrams = [(words[i], words[i+1], words[i+2]) for i in range(len(words)-2)]
print(trigrams)
AThe split method does not create a list of words.
BThe print statement is missing parentheses.
CTuples cannot have three elements in Python.
DThe range goes too far, causing words[i+2] to exceed list length.
Attempts:
2 left
💡 Hint
Check the range limit for accessing words[i+2].

Practice

(1/5)
1. What is an n-gram in natural language processing?
easy
A. A random selection of n words from a text
B. A single word repeated n times
C. A sentence with n words
D. A group of n consecutive words in a text

Solution

  1. Step 1: Understand the definition of n-gram

    An n-gram is defined as a sequence of n consecutive words appearing together in text.
  2. Step 2: Compare options with definition

    Only A group of n consecutive words in a text correctly describes an n-gram as consecutive words, not random or repeated words.
  3. Final Answer:

    A group of n consecutive words in a text -> Option D
  4. Quick Check:

    n-gram = consecutive words [OK]
Hint: Remember: n-gram means consecutive words, not random ones [OK]
Common Mistakes:
  • Thinking n-gram means repeated words
  • Confusing n-gram with sentence length
  • Assuming words are randomly picked
2. Which of the following is the correct way to set up a CountVectorizer to extract bigrams in Python?
easy
A. CountVectorizer(ngram_range=(1,1))
B. CountVectorizer(ngram_range=(2,2))
C. CountVectorizer(ngram_range=(0,2))
D. CountVectorizer(ngram_range=(1,3))

Solution

  1. Step 1: Understand ngram_range parameter

    ngram_range=(2,2) extracts only bigrams (groups of exactly 2 words).
  2. Step 2: Evaluate each option

    CountVectorizer(ngram_range=(1,1)) extracts unigrams only; C is invalid because 0 is not a valid n; D extracts unigrams to trigrams.
  3. Final Answer:

    CountVectorizer(ngram_range=(2,2)) -> Option B
  4. Quick Check:

    bigrams = ngram_range (2,2) [OK]
Hint: Set ngram_range=(2,2) for only bigrams [OK]
Common Mistakes:
  • Using (1,1) which extracts unigrams
  • Using (0,2) which is invalid
  • Using (1,3) which extracts multiple n-grams
3. What will be the output tokens when extracting trigrams from the sentence 'I love machine learning' using CountVectorizer(ngram_range=(3,3))?
medium
A. ['I love machine', 'love machine learning']
B. ['I love', 'love machine', 'machine learning']
C. ['I', 'love', 'machine', 'learning']
D. ['I love machine learning']

Solution

  1. Step 1: Understand trigram extraction

    Trigrams are groups of 3 consecutive words. The sentence has 4 words, so possible trigrams are words 1-3 and 2-4.
  2. Step 2: List trigrams from the sentence

    First trigram: 'I love machine', second trigram: 'love machine learning'.
  3. Final Answer:

    ['I love machine', 'love machine learning'] -> Option A
  4. Quick Check:

    Trigrams = groups of 3 words [OK]
Hint: Count groups of 3 consecutive words for trigrams [OK]
Common Mistakes:
  • Listing bigrams instead of trigrams
  • Listing single words instead of groups
  • Combining all words as one token
4. Identify the error in this code snippet for extracting bigrams:
from sklearn.feature_extraction.text import CountVectorizer
text = ['hello world']
vectorizer = CountVectorizer(ngram_range=(1,2))
vectorizer.fit_transform(text)
print(vectorizer.get_feature_names())
medium
A. The text should be a string, not a list
B. The ngram_range should be (2,2) to extract only bigrams
C. The method get_feature_names() is deprecated and should be get_feature_names_out()
D. CountVectorizer cannot extract bigrams

Solution

  1. Step 1: Check method usage

    In recent sklearn versions, get_feature_names() is deprecated; get_feature_names_out() is the correct method.
  2. Step 2: Validate other parts

    ngram_range=(1,2) is valid for unigrams and bigrams; text as list is correct; CountVectorizer supports bigrams.
  3. Final Answer:

    get_feature_names() is deprecated and should be get_feature_names_out() -> Option C
  4. Quick Check:

    Use get_feature_names_out() for features [OK]
Hint: Use get_feature_names_out() instead of deprecated get_feature_names() [OK]
Common Mistakes:
  • Thinking ngram_range=(1,2) is wrong for bigrams
  • Assuming text must be a string, not list
  • Believing CountVectorizer can't extract bigrams
5. You want to build a text prediction model that uses both unigrams and bigrams but excludes any n-grams containing stop words like 'the' or 'and'. Which approach is best?
hard
A. Use CountVectorizer with ngram_range=(1,2) and stop_words='english'
B. Use CountVectorizer with ngram_range=(2,2) and no stop words removal
C. Use CountVectorizer with ngram_range=(1,1) and manually remove stop words after extraction
D. Use CountVectorizer with ngram_range=(1,3) and stop_words=None

Solution

  1. Step 1: Understand requirements

    We need unigrams and bigrams, and want to exclude stop words in any n-gram.
  2. Step 2: Evaluate options

    Use CountVectorizer with ngram_range=(1,2) and stop_words='english' uses ngram_range=(1,2) for unigrams and bigrams and removes stop words automatically. Others either miss unigrams, include stop words, or include trigrams.
  3. Final Answer:

    Use CountVectorizer with ngram_range=(1,2) and stop_words='english' -> Option A
  4. Quick Check:

    Unigrams + bigrams + stop word removal = Use CountVectorizer with ngram_range=(1,2) and stop_words='english' [OK]
Hint: Set ngram_range and stop_words='english' to filter stop words [OK]
Common Mistakes:
  • Not removing stop words from bigrams
  • Using wrong ngram_range missing unigrams
  • Including trigrams when not needed