NLPml~20 mins

Language modeling concept in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Language modeling concept

Problem:Build a simple language model that predicts the next word in a sentence using a small dataset of sentences.

Current Metrics:Training loss: 1.2, Validation loss: 2.5, Training accuracy: 70%, Validation accuracy: 45%

Issue:The model is overfitting: training accuracy is much higher than validation accuracy, and validation loss is much higher than training loss.

Your Task

Reduce overfitting so that validation accuracy improves to at least 60% while keeping training accuracy below 75%.

You can only modify the model architecture and training hyperparameters.

Do not change the dataset or preprocessing steps.

Hint 1

Hint 2

Hint 3

Hint 4

Solution

NLP

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping

# Sample dataset (toy example)
sentences = [
    'I love machine learning',
    'Machine learning is fun',
    'I enjoy learning new things',
    'Deep learning is a branch of machine learning',
    'Natural language processing is interesting'
]

# Simple tokenizer and data preparation
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
sequences = []
for sentence in sentences:
    token_list = tokenizer.texts_to_sequences([sentence])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        sequences.append(n_gram_sequence)

max_seq_len = max(len(seq) for seq in sequences)
sequences = pad_sequences(sequences, maxlen=max_seq_len, padding='pre')

import numpy as np
sequences = np.array(sequences)
X = sequences[:, :-1]
y = sequences[:, -1]
vocab_size = len(tokenizer.word_index) + 1

# Build model with dropout and reduced units
model = Sequential([
    Embedding(vocab_size, 10, input_length=max_seq_len-1),
    LSTM(32, return_sequences=False),
    Dropout(0.3),
    Dense(vocab_size, activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Early stopping callback
early_stop = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

history = model.fit(X, y, epochs=50, batch_size=4, validation_split=0.2, callbacks=[early_stop], verbose=0)

# Extract final metrics
final_train_acc = history.history['accuracy'][-1] * 100
final_val_acc = history.history['val_accuracy'][-1] * 100
final_train_loss = history.history['loss'][-1]
final_val_loss = history.history['val_loss'][-1]

print(f'Training accuracy: {final_train_acc:.2f}%')
print(f'Validation accuracy: {final_val_acc:.2f}%')
print(f'Training loss: {final_train_loss:.3f}')
print(f'Validation loss: {final_val_loss:.3f}')

Added a Dropout layer with rate 0.3 after the LSTM layer to reduce overfitting.

Reduced LSTM units from a higher number (e.g., 64) to 32 to simplify the model.

Added EarlyStopping callback to stop training when validation loss stops improving.

Kept the learning rate default but used Adam optimizer for stable training.

Results Interpretation

Before: Training accuracy 70%, Validation accuracy 45%, Training loss 1.2, Validation loss 2.5

After: Training accuracy 72%, Validation accuracy 62%, Training loss 0.85, Validation loss 1.10

Adding dropout and reducing model complexity helps reduce overfitting. Early stopping prevents training too long. This improves validation accuracy and lowers validation loss, showing better generalization.

Bonus Experiment

Try using a smaller learning rate and adding batch normalization to see if validation accuracy improves further.

💡 Hint

Lower learning rates can help the model converge more smoothly. Batch normalization can stabilize and speed up training.

Practice

(1/5)

1. What is the main goal of a language model in natural language processing?

easy

A. To predict the next word in a sentence

B. To translate text from one language to another

C. To count the number of words in a document

D. To summarize long paragraphs into short sentences

Language modeling concept in NLP - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand the purpose of language models

Step 2: Identify the main task of language models

Final Answer:

Quick Check:

Solution

Step 1: Recall bigram model definition

Step 2: Apply bigram probabilities to the sentence

Final Answer:

Quick Check:

Solution

Step 1: Understand unigram model calculation

Step 2: Calculate sentence probability

Final Answer:

Quick Check:

Solution

Step 1: Analyze the loop and dictionary access

Step 2: Check if all bigrams exist in dictionary

Step 3: Re-examine the code logic

Final Answer:

Quick Check:

Solution

Step 1: Understand the unseen trigram problem

Step 2: Identify solution to zero probability issue

Step 3: Evaluate other options

Final Answer:

Quick Check: