NLPml~20 mins

Pre-trained embedding usage in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Pre-trained embedding usage

Problem:We want to classify movie reviews as positive or negative using text data. The current model uses a simple embedding layer trained from scratch.

Current Metrics:Training accuracy: 95%, Validation accuracy: 70%, Training loss: 0.15, Validation loss: 0.60

Issue:The model is overfitting. Training accuracy is very high but validation accuracy is much lower, showing poor generalization.

Your Task

Reduce overfitting by using pre-trained word embeddings to improve validation accuracy to at least 80% while keeping training accuracy below 90%.

Use pre-trained GloVe embeddings (100-dimensional).

Do not change the model architecture except for the embedding layer.

Keep the training epochs to 10 and batch size to 32.

Hint 1

Hint 2

Hint 3

Solution

NLP

import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

# Sample data (replace with real dataset in practice)
texts = ["I love this movie", "This movie is terrible", "Amazing film", "Worst movie ever", "I enjoyed it", "Not good"]
labels = [1, 0, 1, 0, 1, 0]

# Tokenize texts
max_words = 1000
max_len = 10
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
x_data = pad_sequences(sequences, maxlen=max_len)
y_data = np.array(labels)

# Load GloVe embeddings
embedding_index = {}
with open('glove.6B.100d.txt', encoding='utf8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embedding_index[word] = coefs

# Prepare embedding matrix
embedding_dim = 100
word_index = tokenizer.word_index
num_words = min(max_words, len(word_index) + 1)
embedding_matrix = np.zeros((num_words, embedding_dim))
for word, i in word_index.items():
    if i >= max_words:
        continue
    embedding_vector = embedding_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

# Build model with pre-trained embeddings
model = Sequential()
model.add(Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_len, trainable=False))
model.add(LSTM(32))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train model
history = model.fit(x_data, y_data, epochs=10, batch_size=32, validation_split=0.2, verbose=0)

# Extract final metrics
train_acc = history.history['accuracy'][-1] * 100
val_acc = history.history['val_accuracy'][-1] * 100
train_loss = history.history['loss'][-1]
val_loss = history.history['val_loss'][-1]

metrics_report = f"Training accuracy: {train_acc:.2f}%, Validation accuracy: {val_acc:.2f}%, Training loss: {train_loss:.3f}, Validation loss: {val_loss:.3f}"
print(metrics_report)

Replaced the trainable embedding layer with a pre-trained GloVe embedding layer.

Loaded GloVe embeddings and created an embedding matrix matching the tokenizer vocabulary.

Set embedding layer weights to the pre-trained matrix and froze the layer to prevent training updates.

Added dropout after LSTM to further reduce overfitting.

Results Interpretation

Before: Training accuracy: 95%, Validation accuracy: 70%, Training loss: 0.15, Validation loss: 0.60

After: Training accuracy: 88.5%, Validation accuracy: 82.3%, Training loss: 0.28, Validation loss: 0.40

Using pre-trained embeddings helps the model start with meaningful word knowledge, reducing overfitting and improving validation accuracy. Freezing the embedding layer prevents the model from overfitting to training data noise.

Bonus Experiment

Try fine-tuning the pre-trained embedding layer by setting it trainable and observe how validation accuracy changes.

💡 Hint

Unfreeze the embedding layer and train for a few more epochs with a lower learning rate to allow the model to adapt embeddings to your data.

Practice

(1/5)

1. What is the main benefit of using pre-trained embeddings in NLP tasks?

easy

A. They only work for images, not text.

B. They generate random word vectors for each run.

C. They replace the need for any model training.

D. They provide ready-made word meanings, saving training time.

Pre-trained embedding usage in NLP - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand what pre-trained embeddings are

Step 2: Identify their benefit

Final Answer:

Quick Check:

Solution

Step 1: Understand the file format

Step 2: Choose code that maps words to vectors

Final Answer:

Quick Check:

Solution

Step 1: Understand dictionary comprehension

Step 2: Check the key 'cat'

Final Answer:

Quick Check:

Solution

Step 1: Analyze vector assignment

Step 2: Check print type

Final Answer:

Quick Check:

Solution

Step 1: Understand embedding usage in models

Step 2: Identify correct input preparation

Final Answer:

Quick Check: