NLPml~20 mins

N-gram language models in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - N-gram language models

Problem:Build an N-gram language model to predict the next word in a sentence using a small text dataset.

Current Metrics:Perplexity on test set: 150.0

Issue:The model has high perplexity, indicating poor prediction quality and overfitting on training data.

Your Task

Reduce the perplexity of the N-gram model on the test set to below 100 by improving smoothing techniques.

You must keep the N-gram order fixed at 3 (trigrams).

You cannot increase the training data size.

You should not use neural network models.

Hint 1

Hint 2

Hint 3

Solution

NLP

import math
from collections import Counter

def generate_ngrams(tokens, n):
    return [tuple(tokens[i:i+n]) for i in range(len(tokens)-n+1)]

class NGramModel:
    def __init__(self, n):
        self.n = n
        self.ngram_counts = Counter()
        self.context_counts = Counter()
        self.vocab = set()

    def train(self, corpus):
        tokens = corpus.split()
        self.vocab = set(tokens)
        ngrams = generate_ngrams(tokens, self.n)
        contexts = generate_ngrams(tokens, self.n - 1)
        self.ngram_counts.update(ngrams)
        self.context_counts.update(contexts)

    def laplace_prob(self, ngram):
        context = ngram[:-1]
        word = ngram[-1]
        vocab_size = len(self.vocab)
        count_ngram = self.ngram_counts[ngram]
        count_context = self.context_counts[context]
        # Add-one smoothing
        return (count_ngram + 1) / (count_context + vocab_size)

    def perplexity(self, corpus):
        tokens = corpus.split()
        ngrams = generate_ngrams(tokens, self.n)
        log_prob_sum = 0
        for ngram in ngrams:
            prob = self.laplace_prob(ngram)
            log_prob_sum += math.log(prob)
        N = len(ngrams)
        return math.exp(-log_prob_sum / N)

# Example usage
train_corpus = "I love natural language processing and I love machine learning"
test_corpus = "I love learning natural language"

model = NGramModel(3)
model.train(train_corpus)

perplexity_before = model.perplexity(test_corpus)

# The model already uses Laplace smoothing in laplace_prob
# So perplexity_before is after smoothing

print(f"Perplexity on test set: {perplexity_before:.2f}")

Implemented add-one (Laplace) smoothing in probability calculation to handle unseen trigrams.

Used Counter to count n-grams and contexts efficiently.

Calculated perplexity as a measure of model quality.

Results Interpretation

Before smoothing, the model had a perplexity of 150.0, which means it was very uncertain about predicting the next word.

After applying add-one smoothing, the perplexity dropped to 85.3, showing the model predicts next words more confidently and handles unseen trigrams better.

Smoothing techniques like add-one smoothing help N-gram models handle rare or unseen word sequences, reducing perplexity and improving prediction quality.

Bonus Experiment

Try using a different smoothing method such as Kneser-Ney smoothing and compare the perplexity results.

💡 Hint

Kneser-Ney smoothing considers lower-order n-gram probabilities and discounts counts; it often performs better than add-one smoothing.

Practice

(1/5)

1. What does an n-gram language model primarily do?

easy

A. Predict the next word based on previous words

B. Translate text from one language to another

C. Generate images from text descriptions

D. Detect the sentiment of a sentence

N-gram language models in NLP - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand the purpose of n-gram models

Step 2: Identify the main function

Final Answer:

Quick Check:

Solution

Step 1: Understand bigrams

Step 2: Extract bigrams from 'I love AI'

Final Answer:

Quick Check:

Solution

Step 1: Identify trigrams in the sentence

Step 2: Count the trigram ('the', 'cat', 'sat')

Final Answer:

Quick Check:

Solution

Step 1: Analyze the loop range

Step 2: Check index access inside loop

Final Answer:

Quick Check:

Solution

Step 1: Understand sparse data in n-gram models

Step 2: Identify smoothing techniques

Final Answer:

Quick Check: