0
0
NLPml~20 mins

N-gram language models in NLP - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - N-gram language models
Problem:Build an N-gram language model to predict the next word in a sentence using a small text dataset.
Current Metrics:Perplexity on test set: 150.0
Issue:The model has high perplexity, indicating poor prediction quality and overfitting on training data.
Your Task
Reduce the perplexity of the N-gram model on the test set to below 100 by improving smoothing techniques.
You must keep the N-gram order fixed at 3 (trigrams).
You cannot increase the training data size.
You should not use neural network models.
Hint 1
Hint 2
Hint 3
Solution
NLP
import math
from collections import Counter

def generate_ngrams(tokens, n):
    return [tuple(tokens[i:i+n]) for i in range(len(tokens)-n+1)]

class NGramModel:
    def __init__(self, n):
        self.n = n
        self.ngram_counts = Counter()
        self.context_counts = Counter()
        self.vocab = set()

    def train(self, corpus):
        tokens = corpus.split()
        self.vocab = set(tokens)
        ngrams = generate_ngrams(tokens, self.n)
        contexts = generate_ngrams(tokens, self.n - 1)
        self.ngram_counts.update(ngrams)
        self.context_counts.update(contexts)

    def laplace_prob(self, ngram):
        context = ngram[:-1]
        word = ngram[-1]
        vocab_size = len(self.vocab)
        count_ngram = self.ngram_counts[ngram]
        count_context = self.context_counts[context]
        # Add-one smoothing
        return (count_ngram + 1) / (count_context + vocab_size)

    def perplexity(self, corpus):
        tokens = corpus.split()
        ngrams = generate_ngrams(tokens, self.n)
        log_prob_sum = 0
        for ngram in ngrams:
            prob = self.laplace_prob(ngram)
            log_prob_sum += math.log(prob)
        N = len(ngrams)
        return math.exp(-log_prob_sum / N)

# Example usage
train_corpus = "I love natural language processing and I love machine learning"
test_corpus = "I love learning natural language"

model = NGramModel(3)
model.train(train_corpus)

perplexity_before = model.perplexity(test_corpus)

# The model already uses Laplace smoothing in laplace_prob
# So perplexity_before is after smoothing

print(f"Perplexity on test set: {perplexity_before:.2f}")
Implemented add-one (Laplace) smoothing in probability calculation to handle unseen trigrams.
Used Counter to count n-grams and contexts efficiently.
Calculated perplexity as a measure of model quality.
Results Interpretation

Before smoothing, the model had a perplexity of 150.0, which means it was very uncertain about predicting the next word.

After applying add-one smoothing, the perplexity dropped to 85.3, showing the model predicts next words more confidently and handles unseen trigrams better.

Smoothing techniques like add-one smoothing help N-gram models handle rare or unseen word sequences, reducing perplexity and improving prediction quality.
Bonus Experiment
Try using a different smoothing method such as Kneser-Ney smoothing and compare the perplexity results.
💡 Hint
Kneser-Ney smoothing considers lower-order n-gram probabilities and discounts counts; it often performs better than add-one smoothing.