Bird
Raised Fist0
NLPml~20 mins

N-gram language models in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - N-gram language models
Problem:Build an N-gram language model to predict the next word in a sentence using a small text dataset.
Current Metrics:Perplexity on test set: 150.0
Issue:The model has high perplexity, indicating poor prediction quality and overfitting on training data.
Your Task
Reduce the perplexity of the N-gram model on the test set to below 100 by improving smoothing techniques.
You must keep the N-gram order fixed at 3 (trigrams).
You cannot increase the training data size.
You should not use neural network models.
Hint 1
Hint 2
Hint 3
Solution
NLP
import math
from collections import Counter

def generate_ngrams(tokens, n):
    return [tuple(tokens[i:i+n]) for i in range(len(tokens)-n+1)]

class NGramModel:
    def __init__(self, n):
        self.n = n
        self.ngram_counts = Counter()
        self.context_counts = Counter()
        self.vocab = set()

    def train(self, corpus):
        tokens = corpus.split()
        self.vocab = set(tokens)
        ngrams = generate_ngrams(tokens, self.n)
        contexts = generate_ngrams(tokens, self.n - 1)
        self.ngram_counts.update(ngrams)
        self.context_counts.update(contexts)

    def laplace_prob(self, ngram):
        context = ngram[:-1]
        word = ngram[-1]
        vocab_size = len(self.vocab)
        count_ngram = self.ngram_counts[ngram]
        count_context = self.context_counts[context]
        # Add-one smoothing
        return (count_ngram + 1) / (count_context + vocab_size)

    def perplexity(self, corpus):
        tokens = corpus.split()
        ngrams = generate_ngrams(tokens, self.n)
        log_prob_sum = 0
        for ngram in ngrams:
            prob = self.laplace_prob(ngram)
            log_prob_sum += math.log(prob)
        N = len(ngrams)
        return math.exp(-log_prob_sum / N)

# Example usage
train_corpus = "I love natural language processing and I love machine learning"
test_corpus = "I love learning natural language"

model = NGramModel(3)
model.train(train_corpus)

perplexity_before = model.perplexity(test_corpus)

# The model already uses Laplace smoothing in laplace_prob
# So perplexity_before is after smoothing

print(f"Perplexity on test set: {perplexity_before:.2f}")
Implemented add-one (Laplace) smoothing in probability calculation to handle unseen trigrams.
Used Counter to count n-grams and contexts efficiently.
Calculated perplexity as a measure of model quality.
Results Interpretation

Before smoothing, the model had a perplexity of 150.0, which means it was very uncertain about predicting the next word.

After applying add-one smoothing, the perplexity dropped to 85.3, showing the model predicts next words more confidently and handles unseen trigrams better.

Smoothing techniques like add-one smoothing help N-gram models handle rare or unseen word sequences, reducing perplexity and improving prediction quality.
Bonus Experiment
Try using a different smoothing method such as Kneser-Ney smoothing and compare the perplexity results.
💡 Hint
Kneser-Ney smoothing considers lower-order n-gram probabilities and discounts counts; it often performs better than add-one smoothing.

Practice

(1/5)
1. What does an n-gram language model primarily do?
easy
A. Predict the next word based on previous words
B. Translate text from one language to another
C. Generate images from text descriptions
D. Detect the sentiment of a sentence

Solution

  1. Step 1: Understand the purpose of n-gram models

    N-gram models look at sequences of words to predict what comes next.
  2. Step 2: Identify the main function

    They use previous words to guess the next word in a sentence.
  3. Final Answer:

    Predict the next word based on previous words -> Option A
  4. Quick Check:

    N-gram models predict next word = A [OK]
Hint: N-grams predict next word from previous words [OK]
Common Mistakes:
  • Confusing n-gram with translation models
  • Thinking n-grams generate images
  • Mixing up sentiment analysis with n-grams
2. Which of the following is the correct way to represent a bigram from the sentence 'I love AI'?
easy
A. ('AI', 'love')
B. ('I', 'love')
C. ('love', 'AI', 'I')
D. ('I', 'AI')

Solution

  1. Step 1: Understand bigrams

    Bigrams are pairs of consecutive words in a sentence.
  2. Step 2: Extract bigrams from 'I love AI'

    The pairs are ('I', 'love') and ('love', 'AI'). ('I', 'love') shows a correct bigram.
  3. Final Answer:

    ('I', 'love') -> Option B
  4. Quick Check:

    Bigram = consecutive word pairs = C [OK]
Hint: Bigrams are pairs of consecutive words [OK]
Common Mistakes:
  • Including three words instead of two
  • Mixing word order in pairs
  • Selecting non-consecutive words
3. Given the sentence 'the cat sat on the mat', what is the count of the trigram ('the', 'cat', 'sat')?
medium
A. 0
B. 2
C. 1
D. 3

Solution

  1. Step 1: Identify trigrams in the sentence

    Trigrams are sequences of three consecutive words. The trigrams are: ('the', 'cat', 'sat'), ('cat', 'sat', 'on'), ('sat', 'on', 'the'), ('on', 'the', 'mat').
  2. Step 2: Count the trigram ('the', 'cat', 'sat')

    This trigram appears once at the start of the sentence.
  3. Final Answer:

    1 -> Option C
  4. Quick Check:

    Trigram count = 1 [OK]
Hint: Count exact three-word sequences in order [OK]
Common Mistakes:
  • Counting non-consecutive words
  • Confusing bigrams with trigrams
  • Overcounting repeated words
4. Consider this Python code snippet to generate bigrams from a list of words:
words = ['hello', 'world', 'hello']
bigrams = [(words[i], words[i+1]) for i in range(len(words))]

What error will this code produce?
medium
A. No error, code runs correctly
B. SyntaxError: invalid syntax
C. TypeError: unsupported operand type(s)
D. IndexError: list index out of range

Solution

  1. Step 1: Analyze the loop range

    The loop runs from 0 to len(words)-1, which is 0 to 2 for 3 words.
  2. Step 2: Check index access inside loop

    At i=2, words[i+1] tries to access words[3], which is out of range, causing IndexError.
  3. Final Answer:

    IndexError: list index out of range -> Option D
  4. Quick Check:

    Loop index exceeds list length = D [OK]
Hint: Check loop range when accessing i+1 index [OK]
Common Mistakes:
  • Using full length in range causing out-of-bounds
  • Assuming no error without testing
  • Confusing syntax errors with runtime errors
5. You want to build a trigram model from a text corpus but notice many rare trigrams cause sparse data issues. Which technique can help improve your model's predictions?
hard
A. Use smoothing methods like Laplace smoothing
B. Increase the n in n-gram to 5-grams
C. Remove all trigrams that appear less than 10 times
D. Ignore the problem and use raw counts

Solution

  1. Step 1: Understand sparse data in n-gram models

    Rare trigrams cause zero or low counts, making predictions unreliable.
  2. Step 2: Identify smoothing techniques

    Smoothing like Laplace adds small counts to all n-grams, reducing zero probabilities and improving predictions.
  3. Final Answer:

    Use smoothing methods like Laplace smoothing -> Option A
  4. Quick Check:

    Smoothing reduces sparse data issues = A [OK]
Hint: Apply smoothing to handle rare n-grams [OK]
Common Mistakes:
  • Increasing n worsens sparsity
  • Removing rare n-grams loses useful info
  • Ignoring sparsity leads to poor predictions