Bird
Raised Fist0
NLPml~5 mins

N-grams in NLP - Cheat Sheet & Quick Revision

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Recall & Review
beginner
What is an N-gram in natural language processing?
An N-gram is a sequence of N words or tokens that appear together in text. For example, a 2-gram (bigram) is two words in a row, like "good morning."
Click to reveal answer
beginner
What is the difference between a unigram, bigram, and trigram?
A unigram is a single word, a bigram is a pair of two consecutive words, and a trigram is a group of three consecutive words.
Click to reveal answer
intermediate
Why are N-grams useful in language models?
N-grams help predict the next word by looking at the previous N-1 words. This helps computers understand context and improve tasks like text prediction or spelling correction.
Click to reveal answer
intermediate
What is a limitation of using large N in N-grams?
Using large N (like 5-grams or more) can cause data sparsity, meaning many sequences appear rarely or never, making it hard for the model to learn well.
Click to reveal answer
beginner
How can N-grams be used to detect common phrases?
By counting how often N-grams appear in text, we can find common phrases or word combinations that occur frequently, like "New York City" or "machine learning."
Click to reveal answer
What does a bigram represent?
ATwo consecutive words
BOne single word
CThree consecutive words
DA sentence
Which N-gram size is called a trigram?
A2
B3
C1
D4
What problem can happen if N is too large in N-grams?
AData sparsity
BToo many common phrases
CWords lose meaning
DModel runs faster
How do N-grams help in text prediction?
ABy counting characters
BBy ignoring previous words
CBy translating text
DBy looking at previous N-1 words to predict the next
Which of these is an example of a unigram?
A"machine learning"
B"learning model"
C"machine"
D"deep neural network"
Explain what N-grams are and how they are used in language processing.
Think about sequences of words and how they help computers understand text.
You got /3 concepts.
    Describe one advantage and one limitation of using N-grams in text analysis.
    Consider what N-grams help with and what problems happen when N grows.
    You got /2 concepts.

      Practice

      (1/5)
      1. What is an n-gram in natural language processing?
      easy
      A. A random selection of n words from a text
      B. A single word repeated n times
      C. A sentence with n words
      D. A group of n consecutive words in a text

      Solution

      1. Step 1: Understand the definition of n-gram

        An n-gram is defined as a sequence of n consecutive words appearing together in text.
      2. Step 2: Compare options with definition

        Only A group of n consecutive words in a text correctly describes an n-gram as consecutive words, not random or repeated words.
      3. Final Answer:

        A group of n consecutive words in a text -> Option D
      4. Quick Check:

        n-gram = consecutive words [OK]
      Hint: Remember: n-gram means consecutive words, not random ones [OK]
      Common Mistakes:
      • Thinking n-gram means repeated words
      • Confusing n-gram with sentence length
      • Assuming words are randomly picked
      2. Which of the following is the correct way to set up a CountVectorizer to extract bigrams in Python?
      easy
      A. CountVectorizer(ngram_range=(1,1))
      B. CountVectorizer(ngram_range=(2,2))
      C. CountVectorizer(ngram_range=(0,2))
      D. CountVectorizer(ngram_range=(1,3))

      Solution

      1. Step 1: Understand ngram_range parameter

        ngram_range=(2,2) extracts only bigrams (groups of exactly 2 words).
      2. Step 2: Evaluate each option

        CountVectorizer(ngram_range=(1,1)) extracts unigrams only; C is invalid because 0 is not a valid n; D extracts unigrams to trigrams.
      3. Final Answer:

        CountVectorizer(ngram_range=(2,2)) -> Option B
      4. Quick Check:

        bigrams = ngram_range (2,2) [OK]
      Hint: Set ngram_range=(2,2) for only bigrams [OK]
      Common Mistakes:
      • Using (1,1) which extracts unigrams
      • Using (0,2) which is invalid
      • Using (1,3) which extracts multiple n-grams
      3. What will be the output tokens when extracting trigrams from the sentence 'I love machine learning' using CountVectorizer(ngram_range=(3,3))?
      medium
      A. ['I love machine', 'love machine learning']
      B. ['I love', 'love machine', 'machine learning']
      C. ['I', 'love', 'machine', 'learning']
      D. ['I love machine learning']

      Solution

      1. Step 1: Understand trigram extraction

        Trigrams are groups of 3 consecutive words. The sentence has 4 words, so possible trigrams are words 1-3 and 2-4.
      2. Step 2: List trigrams from the sentence

        First trigram: 'I love machine', second trigram: 'love machine learning'.
      3. Final Answer:

        ['I love machine', 'love machine learning'] -> Option A
      4. Quick Check:

        Trigrams = groups of 3 words [OK]
      Hint: Count groups of 3 consecutive words for trigrams [OK]
      Common Mistakes:
      • Listing bigrams instead of trigrams
      • Listing single words instead of groups
      • Combining all words as one token
      4. Identify the error in this code snippet for extracting bigrams:
      from sklearn.feature_extraction.text import CountVectorizer
      text = ['hello world']
      vectorizer = CountVectorizer(ngram_range=(1,2))
      vectorizer.fit_transform(text)
      print(vectorizer.get_feature_names())
      medium
      A. The text should be a string, not a list
      B. The ngram_range should be (2,2) to extract only bigrams
      C. The method get_feature_names() is deprecated and should be get_feature_names_out()
      D. CountVectorizer cannot extract bigrams

      Solution

      1. Step 1: Check method usage

        In recent sklearn versions, get_feature_names() is deprecated; get_feature_names_out() is the correct method.
      2. Step 2: Validate other parts

        ngram_range=(1,2) is valid for unigrams and bigrams; text as list is correct; CountVectorizer supports bigrams.
      3. Final Answer:

        get_feature_names() is deprecated and should be get_feature_names_out() -> Option C
      4. Quick Check:

        Use get_feature_names_out() for features [OK]
      Hint: Use get_feature_names_out() instead of deprecated get_feature_names() [OK]
      Common Mistakes:
      • Thinking ngram_range=(1,2) is wrong for bigrams
      • Assuming text must be a string, not list
      • Believing CountVectorizer can't extract bigrams
      5. You want to build a text prediction model that uses both unigrams and bigrams but excludes any n-grams containing stop words like 'the' or 'and'. Which approach is best?
      hard
      A. Use CountVectorizer with ngram_range=(1,2) and stop_words='english'
      B. Use CountVectorizer with ngram_range=(2,2) and no stop words removal
      C. Use CountVectorizer with ngram_range=(1,1) and manually remove stop words after extraction
      D. Use CountVectorizer with ngram_range=(1,3) and stop_words=None

      Solution

      1. Step 1: Understand requirements

        We need unigrams and bigrams, and want to exclude stop words in any n-gram.
      2. Step 2: Evaluate options

        Use CountVectorizer with ngram_range=(1,2) and stop_words='english' uses ngram_range=(1,2) for unigrams and bigrams and removes stop words automatically. Others either miss unigrams, include stop words, or include trigrams.
      3. Final Answer:

        Use CountVectorizer with ngram_range=(1,2) and stop_words='english' -> Option A
      4. Quick Check:

        Unigrams + bigrams + stop word removal = Use CountVectorizer with ngram_range=(1,2) and stop_words='english' [OK]
      Hint: Set ngram_range and stop_words='english' to filter stop words [OK]
      Common Mistakes:
      • Not removing stop words from bigrams
      • Using wrong ngram_range missing unigrams
      • Including trigrams when not needed