For language models, the main metric is Perplexity. It measures how well the model predicts the next word. A lower perplexity means the model is better at guessing the next word in a sentence, just like how a good friend can finish your sentence correctly. Perplexity is important because it directly shows how confident and accurate the model is in understanding language patterns.
Language modeling concept in NLP - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
Language modeling usually predicts many possible next words, so a confusion matrix is not typical. Instead, we look at Perplexity, which is calculated from the probabilities the model assigns to the correct next words.
Perplexity = 2^(- (1/N) * sum(log2 P(w_i | context)))
Where:
- N is the number of words
- P(w_i | context) is the predicted probability of the actual next word
Example:
If the model predicts the next word with 0.5 probability, perplexity contribution is 2^(-log2(0.5)) = 2^1 = 2
Lower perplexity means better prediction.
Precision and recall are less common for language modeling because it predicts probabilities over many words. But if we think about next word prediction as a classification task, there is a tradeoff:
- Precision: How often the predicted word is actually correct. High precision means the model rarely guesses wrong words.
- Recall: How many of the correct next words the model can predict. High recall means the model covers many possible correct words.
For example, in autocomplete on your phone, high precision avoids annoying wrong suggestions, while high recall helps suggest many useful words. Language models balance these by assigning probabilities to many words.
Good: Perplexity close to 10 or lower on a test set means the model predicts next words well. It shows the model understands language patterns clearly.
Bad: Perplexity above 100 means the model is very confused and guesses poorly. It might be just picking words randomly or not learning from data.
Remember, perplexity depends on dataset size and complexity, so compare models on the same data.
- Overfitting: Very low perplexity on training data but high on test data means the model memorizes instead of learning language rules.
- Data leakage: If test sentences appear in training, perplexity looks artificially low, hiding true performance.
- Ignoring context length: Short context can make perplexity look better but model may fail on longer sentences.
- Comparing across datasets: Perplexity values vary by dataset size and vocabulary, so only compare models on the same data.
Your language model has a perplexity of 50 on training data but 200 on test data. Is it good? Why or why not?
Answer: This is not good. The model performs well on training but poorly on test data, showing it memorized training sentences and cannot generalize to new text. It needs better training or regularization.
Practice
Solution
Step 1: Understand the purpose of language models
Language models are designed to understand and predict text sequences.Step 2: Identify the main task of language models
The core task is to predict the next word based on previous words in a sentence.Final Answer:
To predict the next word in a sentence -> Option AQuick Check:
Language model goal = predict next word [OK]
- Confusing language modeling with translation
- Thinking language models only count words
- Assuming summarization is the main task
"I love AI"?Solution
Step 1: Recall bigram model definition
A bigram model predicts each word based on the previous word, so probabilities are conditional.Step 2: Apply bigram probabilities to the sentence
The sentence probability is P(I) * P(love | I) * P(AI | love), starting with the first word's probability.Final Answer:
P(I) * P(love | I) * P(AI | love) -> Option DQuick Check:
Bigram = word depends on previous word [OK]
- Multiplying independent word probabilities (unigram)
- Using wrong conditional order
- Confusing bigram with trigram or other models
"I love AI" under a unigram model?Solution
Step 1: Understand unigram model calculation
Unigram model assumes words are independent, so multiply their probabilities.Step 2: Calculate sentence probability
Multiply P(I) * P(love) * P(AI) = 0.2 * 0.1 * 0.05 = 0.001Final Answer:
0.001 -> Option BQuick Check:
Unigram multiply all word probs = 0.001 [OK]
- Adding probabilities instead of multiplying
- Using conditional probabilities (bigram) by mistake
- Incorrect multiplication order
sentence = ['I', 'love', 'AI']
bigram_probs = {('I', 'love'): 0.3, ('love', 'AI'): 0.4}
prob = 1.0
for i in range(len(sentence)-1):
prob *= bigram_probs[(sentence[i], sentence[i+1])]
print(prob)What error will occur when running this code?
Solution
Step 1: Analyze the loop and dictionary access
The loop multiplies probabilities for bigrams in the sentence using bigram_probs dictionary keys.Step 2: Check if all bigrams exist in dictionary
bigram_probs lacks a probability for the first word alone, but code only uses pairs, so no missing keys for pairs.Step 3: Re-examine the code logic
All bigrams ('I','love') and ('love','AI') exist in dictionary, so no KeyError. No TypeError or IndexError expected.Final Answer:
No error, prints 0.12 -> Option AQuick Check:
All bigrams found, multiply 0.3*0.4=0.12 [OK]
- Assuming first word needs separate probability
- Confusing KeyError with IndexError
- Ignoring dictionary key structure
Solution
Step 1: Understand the unseen trigram problem
Unseen trigrams cause zero probabilities, which harm model predictions.Step 2: Identify solution to zero probability issue
Smoothing techniques like Kneser-Ney adjust probabilities to handle unseen cases effectively.Step 3: Evaluate other options
Ignoring unseen trigrams or only using unigram probabilities lose context; increasing data alone may not solve sparsity.Final Answer:
Use smoothing techniques like Kneser-Ney smoothing -> Option CQuick Check:
Smoothing fixes zero probs for unseen trigrams [OK]
- Assigning zero probability to unseen trigrams
- Ignoring context by using only unigrams
- Relying solely on more data without smoothing
