Bird
Raised Fist0
NLPml~20 mins

Language modeling concept in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Language modeling concept
Problem:Build a simple language model that predicts the next word in a sentence using a small dataset of sentences.
Current Metrics:Training loss: 1.2, Validation loss: 2.5, Training accuracy: 70%, Validation accuracy: 45%
Issue:The model is overfitting: training accuracy is much higher than validation accuracy, and validation loss is much higher than training loss.
Your Task
Reduce overfitting so that validation accuracy improves to at least 60% while keeping training accuracy below 75%.
You can only modify the model architecture and training hyperparameters.
Do not change the dataset or preprocessing steps.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
NLP
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping

# Sample dataset (toy example)
sentences = [
    'I love machine learning',
    'Machine learning is fun',
    'I enjoy learning new things',
    'Deep learning is a branch of machine learning',
    'Natural language processing is interesting'
]

# Simple tokenizer and data preparation
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
sequences = []
for sentence in sentences:
    token_list = tokenizer.texts_to_sequences([sentence])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        sequences.append(n_gram_sequence)

max_seq_len = max(len(seq) for seq in sequences)
sequences = pad_sequences(sequences, maxlen=max_seq_len, padding='pre')

import numpy as np
sequences = np.array(sequences)
X = sequences[:, :-1]
y = sequences[:, -1]
vocab_size = len(tokenizer.word_index) + 1

# Build model with dropout and reduced units
model = Sequential([
    Embedding(vocab_size, 10, input_length=max_seq_len-1),
    LSTM(32, return_sequences=False),
    Dropout(0.3),
    Dense(vocab_size, activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Early stopping callback
early_stop = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

history = model.fit(X, y, epochs=50, batch_size=4, validation_split=0.2, callbacks=[early_stop], verbose=0)

# Extract final metrics
final_train_acc = history.history['accuracy'][-1] * 100
final_val_acc = history.history['val_accuracy'][-1] * 100
final_train_loss = history.history['loss'][-1]
final_val_loss = history.history['val_loss'][-1]

print(f'Training accuracy: {final_train_acc:.2f}%')
print(f'Validation accuracy: {final_val_acc:.2f}%')
print(f'Training loss: {final_train_loss:.3f}')
print(f'Validation loss: {final_val_loss:.3f}')
Added a Dropout layer with rate 0.3 after the LSTM layer to reduce overfitting.
Reduced LSTM units from a higher number (e.g., 64) to 32 to simplify the model.
Added EarlyStopping callback to stop training when validation loss stops improving.
Kept the learning rate default but used Adam optimizer for stable training.
Results Interpretation

Before: Training accuracy 70%, Validation accuracy 45%, Training loss 1.2, Validation loss 2.5

After: Training accuracy 72%, Validation accuracy 62%, Training loss 0.85, Validation loss 1.10

Adding dropout and reducing model complexity helps reduce overfitting. Early stopping prevents training too long. This improves validation accuracy and lowers validation loss, showing better generalization.
Bonus Experiment
Try using a smaller learning rate and adding batch normalization to see if validation accuracy improves further.
💡 Hint
Lower learning rates can help the model converge more smoothly. Batch normalization can stabilize and speed up training.

Practice

(1/5)
1. What is the main goal of a language model in natural language processing?
easy
A. To predict the next word in a sentence
B. To translate text from one language to another
C. To count the number of words in a document
D. To summarize long paragraphs into short sentences

Solution

  1. Step 1: Understand the purpose of language models

    Language models are designed to understand and predict text sequences.
  2. Step 2: Identify the main task of language models

    The core task is to predict the next word based on previous words in a sentence.
  3. Final Answer:

    To predict the next word in a sentence -> Option A
  4. Quick Check:

    Language model goal = predict next word [OK]
Hint: Language models guess the next word in text [OK]
Common Mistakes:
  • Confusing language modeling with translation
  • Thinking language models only count words
  • Assuming summarization is the main task
2. Which of the following is the correct way to represent a bigram language model probability for a sentence "I love AI"?
easy
A. P(I) * P(love) * P(AI)
B. P(I | AI) * P(love | I) * P(AI | love)
C. P(I | love) * P(love | AI) * P(AI)
D. P(I) * P(love | I) * P(AI | love)

Solution

  1. Step 1: Recall bigram model definition

    A bigram model predicts each word based on the previous word, so probabilities are conditional.
  2. Step 2: Apply bigram probabilities to the sentence

    The sentence probability is P(I) * P(love | I) * P(AI | love), starting with the first word's probability.
  3. Final Answer:

    P(I) * P(love | I) * P(AI | love) -> Option D
  4. Quick Check:

    Bigram = word depends on previous word [OK]
Hint: Bigram means each word depends on the one before [OK]
Common Mistakes:
  • Multiplying independent word probabilities (unigram)
  • Using wrong conditional order
  • Confusing bigram with trigram or other models
3. Given the following unigram probabilities: P(I)=0.2, P(love)=0.1, P(AI)=0.05, what is the probability of the sentence "I love AI" under a unigram model?
medium
A. 0.01
B. 0.001
C. 0.35
D. 0.0001

Solution

  1. Step 1: Understand unigram model calculation

    Unigram model assumes words are independent, so multiply their probabilities.
  2. Step 2: Calculate sentence probability

    Multiply P(I) * P(love) * P(AI) = 0.2 * 0.1 * 0.05 = 0.001
  3. Final Answer:

    0.001 -> Option B
  4. Quick Check:

    Unigram multiply all word probs = 0.001 [OK]
Hint: Multiply all word probabilities for unigram [OK]
Common Mistakes:
  • Adding probabilities instead of multiplying
  • Using conditional probabilities (bigram) by mistake
  • Incorrect multiplication order
4. Consider this Python code snippet for a bigram model probability calculation:
sentence = ['I', 'love', 'AI']
bigram_probs = {('I', 'love'): 0.3, ('love', 'AI'): 0.4}
prob = 1.0
for i in range(len(sentence)-1):
    prob *= bigram_probs[(sentence[i], sentence[i+1])]
print(prob)

What error will occur when running this code?
medium
A. No error, prints 0.12
B. TypeError due to wrong data type in multiplication
C. KeyError because the first word probability is missing
D. IndexError because of range length

Solution

  1. Step 1: Analyze the loop and dictionary access

    The loop multiplies probabilities for bigrams in the sentence using bigram_probs dictionary keys.
  2. Step 2: Check if all bigrams exist in dictionary

    bigram_probs lacks a probability for the first word alone, but code only uses pairs, so no missing keys for pairs.
  3. Step 3: Re-examine the code logic

    All bigrams ('I','love') and ('love','AI') exist in dictionary, so no KeyError. No TypeError or IndexError expected.
  4. Final Answer:

    No error, prints 0.12 -> Option A
  5. Quick Check:

    All bigrams found, multiply 0.3*0.4=0.12 [OK]
Hint: Check if all keys exist before dictionary access [OK]
Common Mistakes:
  • Assuming first word needs separate probability
  • Confusing KeyError with IndexError
  • Ignoring dictionary key structure
5. You want to build a trigram language model to predict the next word given two previous words. Which approach best handles the problem of unseen trigrams in your training data?
hard
A. Only use unigram probabilities for all predictions
B. Ignore unseen trigrams and assign zero probability
C. Use smoothing techniques like Kneser-Ney smoothing
D. Increase the training data size without smoothing

Solution

  1. Step 1: Understand the unseen trigram problem

    Unseen trigrams cause zero probabilities, which harm model predictions.
  2. Step 2: Identify solution to zero probability issue

    Smoothing techniques like Kneser-Ney adjust probabilities to handle unseen cases effectively.
  3. Step 3: Evaluate other options

    Ignoring unseen trigrams or only using unigram probabilities lose context; increasing data alone may not solve sparsity.
  4. Final Answer:

    Use smoothing techniques like Kneser-Ney smoothing -> Option C
  5. Quick Check:

    Smoothing fixes zero probs for unseen trigrams [OK]
Hint: Use smoothing to avoid zero probabilities [OK]
Common Mistakes:
  • Assigning zero probability to unseen trigrams
  • Ignoring context by using only unigrams
  • Relying solely on more data without smoothing