Practice

(1/5)

1. What is an n-gram in natural language processing?

easy

A. A random selection of n words from a text

B. A single word repeated n times

C. A sentence with n words

D. A group of n consecutive words in a text

Solution

Step 1: Understand the definition of n-gram
An n-gram is defined as a sequence of n consecutive words appearing together in text.
Step 2: Compare options with definition
Only A group of n consecutive words in a text correctly describes an n-gram as consecutive words, not random or repeated words.
Final Answer:
A group of n consecutive words in a text -> Option D
Quick Check:
n-gram = consecutive words [OK]

Hint: Remember: n-gram means consecutive words, not random ones [OK]

Common Mistakes:

Thinking n-gram means repeated words
Confusing n-gram with sentence length
Assuming words are randomly picked

2. Which of the following is the correct way to set up a CountVectorizer to extract bigrams in Python?

easy

A. CountVectorizer(ngram_range=(1,1))

B. CountVectorizer(ngram_range=(2,2))

C. CountVectorizer(ngram_range=(0,2))

D. CountVectorizer(ngram_range=(1,3))

Solution

Step 1: Understand ngram_range parameter
ngram_range=(2,2) extracts only bigrams (groups of exactly 2 words).
Step 2: Evaluate each option
CountVectorizer(ngram_range=(1,1)) extracts unigrams only; C is invalid because 0 is not a valid n; D extracts unigrams to trigrams.
Final Answer:
CountVectorizer(ngram_range=(2,2)) -> Option B
Quick Check:
bigrams = ngram_range (2,2) [OK]

Hint: Set ngram_range=(2,2) for only bigrams [OK]

Common Mistakes:

Using (1,1) which extracts unigrams
Using (0,2) which is invalid
Using (1,3) which extracts multiple n-grams

3. What will be the output tokens when extracting trigrams from the sentence 'I love machine learning' using CountVectorizer(ngram_range=(3,3))?

medium

A. ['I love machine', 'love machine learning']

B. ['I love', 'love machine', 'machine learning']

C. ['I', 'love', 'machine', 'learning']

D. ['I love machine learning']

Solution

Step 1: Understand trigram extraction
Trigrams are groups of 3 consecutive words. The sentence has 4 words, so possible trigrams are words 1-3 and 2-4.
Step 2: List trigrams from the sentence
First trigram: 'I love machine', second trigram: 'love machine learning'.
Final Answer:
['I love machine', 'love machine learning'] -> Option A
Quick Check:
Trigrams = groups of 3 words [OK]

Hint: Count groups of 3 consecutive words for trigrams [OK]

Common Mistakes:

Listing bigrams instead of trigrams
Listing single words instead of groups
Combining all words as one token

4. Identify the error in this code snippet for extracting bigrams:

from sklearn.feature_extraction.text import CountVectorizer
text = ['hello world']
vectorizer = CountVectorizer(ngram_range=(1,2))
vectorizer.fit_transform(text)
print(vectorizer.get_feature_names())

medium

A. The text should be a string, not a list

B. The ngram_range should be (2,2) to extract only bigrams

C. The method get_feature_names() is deprecated and should be get_feature_names_out()

D. CountVectorizer cannot extract bigrams

Solution

Step 1: Check method usage
In recent sklearn versions, get_feature_names() is deprecated; get_feature_names_out() is the correct method.
Step 2: Validate other parts
ngram_range=(1,2) is valid for unigrams and bigrams; text as list is correct; CountVectorizer supports bigrams.
Final Answer:
get_feature_names() is deprecated and should be get_feature_names_out() -> Option C
Quick Check:
Use get_feature_names_out() for features [OK]

Hint: Use get_feature_names_out() instead of deprecated get_feature_names() [OK]

Common Mistakes:

Thinking ngram_range=(1,2) is wrong for bigrams
Assuming text must be a string, not list
Believing CountVectorizer can't extract bigrams

5. You want to build a text prediction model that uses both unigrams and bigrams but excludes any n-grams containing stop words like 'the' or 'and'. Which approach is best?

hard

A. Use CountVectorizer with ngram_range=(1,2) and stop_words='english'

B. Use CountVectorizer with ngram_range=(2,2) and no stop words removal

C. Use CountVectorizer with ngram_range=(1,1) and manually remove stop words after extraction

D. Use CountVectorizer with ngram_range=(1,3) and stop_words=None

Solution

Step 1: Understand requirements
We need unigrams and bigrams, and want to exclude stop words in any n-gram.
Step 2: Evaluate options
Use CountVectorizer with ngram_range=(1,2) and stop_words='english' uses ngram_range=(1,2) for unigrams and bigrams and removes stop words automatically. Others either miss unigrams, include stop words, or include trigrams.
Final Answer:
Use CountVectorizer with ngram_range=(1,2) and stop_words='english' -> Option A
Quick Check:
Unigrams + bigrams + stop word removal = Use CountVectorizer with ngram_range=(1,2) and stop_words='english' [OK]

Hint: Set ngram_range and stop_words='english' to filter stop words [OK]

Common Mistakes:

Not removing stop words from bigrams
Using wrong ngram_range missing unigrams
Including trigrams when not needed

Why N-grams in NLP? - Purpose & Use Cases

Start learning this pattern below

Practice

Solution

Step 1: Understand the definition of n-gram

Step 2: Compare options with definition

Final Answer:

Quick Check:

Solution

Step 1: Understand ngram_range parameter

Step 2: Evaluate each option

Final Answer:

Quick Check:

Solution

Step 1: Understand trigram extraction

Step 2: List trigrams from the sentence

Final Answer:

Quick Check:

Solution

Step 1: Check method usage

Step 2: Validate other parts

Final Answer:

Quick Check:

Solution

Step 1: Understand requirements

Step 2: Evaluate options

Final Answer:

Quick Check: