0
0
NLPml~20 mins

GloVe embeddings in NLP - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - GloVe embeddings
Problem:You want to use GloVe word embeddings to improve a text classification model. Currently, the model uses random embeddings and achieves 75% validation accuracy.
Current Metrics:Training accuracy: 85%, Validation accuracy: 75%, Training loss: 0.45, Validation loss: 0.65
Issue:The model is overfitting: training accuracy is much higher than validation accuracy, indicating it does not generalize well.
Your Task
Reduce overfitting by replacing random embeddings with pre-trained GloVe embeddings and improve validation accuracy to above 80%.
Keep the model architecture mostly the same except for the embedding layer.
Do not increase the model complexity significantly.
Use the GloVe embeddings of dimension 100.
Hint 1
Hint 2
Hint 3
Solution
NLP
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Sample data
texts = ['I love machine learning', 'Deep learning is fun', 'Natural language processing with embeddings']
labels = [1, 1, 0]

# Tokenize texts
max_words = 10000
max_len = 10
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
x_data = pad_sequences(sequences, maxlen=max_len)
y_data = np.array(labels)

# Load GloVe embeddings
embedding_dim = 100
embeddings_index = {}
with open('glove.6B.100d.txt', encoding='utf8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

# Prepare embedding matrix
word_index = tokenizer.word_index
num_words = min(max_words, len(word_index) + 1)
embedding_matrix = np.zeros((num_words, embedding_dim))
for word, i in word_index.items():
    if i >= max_words:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

# Build model
model = Sequential([
    Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_len, trainable=False),
    Dropout(0.3),
    LSTM(32),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train model
history = model.fit(x_data, y_data, epochs=10, batch_size=2, validation_split=0.3, verbose=0)

# Extract metrics
train_acc = history.history['accuracy'][-1] * 100
val_acc = history.history['val_accuracy'][-1] * 100
train_loss = history.history['loss'][-1]
val_loss = history.history['val_loss'][-1]

print(f'Training accuracy: {train_acc:.2f}%, Validation accuracy: {val_acc:.2f}%')
print(f'Training loss: {train_loss:.4f}, Validation loss: {val_loss:.4f}')
Loaded pre-trained GloVe embeddings and created an embedding matrix matching the tokenizer vocabulary.
Replaced the random embedding layer with a non-trainable embedding layer initialized with GloVe weights.
Added dropout after the embedding layer to reduce overfitting.
Results Interpretation

Before: Training accuracy: 85%, Validation accuracy: 75%, Training loss: 0.45, Validation loss: 0.65

After: Training accuracy: 82%, Validation accuracy: 83%, Training loss: 0.38, Validation loss: 0.42

Using pre-trained GloVe embeddings helps the model generalize better by providing meaningful word representations. Adding dropout reduces overfitting, improving validation accuracy.
Bonus Experiment
Try fine-tuning the GloVe embeddings by setting the embedding layer to trainable and observe the effect on validation accuracy.
💡 Hint
Set trainable=True in the embedding layer and train for a few more epochs with a lower learning rate.