NLPml~20 mins

GloVe embeddings in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - GloVe embeddings

Problem:You want to use GloVe word embeddings to improve a text classification model. Currently, the model uses random embeddings and achieves 75% validation accuracy.

Current Metrics:Training accuracy: 85%, Validation accuracy: 75%, Training loss: 0.45, Validation loss: 0.65

Issue:The model is overfitting: training accuracy is much higher than validation accuracy, indicating it does not generalize well.

Your Task

Reduce overfitting by replacing random embeddings with pre-trained GloVe embeddings and improve validation accuracy to above 80%.

Keep the model architecture mostly the same except for the embedding layer.

Do not increase the model complexity significantly.

Use the GloVe embeddings of dimension 100.

Hint 1

Hint 2

Hint 3

Solution

NLP

import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Sample data
texts = ['I love machine learning', 'Deep learning is fun', 'Natural language processing with embeddings']
labels = [1, 1, 0]

# Tokenize texts
max_words = 10000
max_len = 10
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
x_data = pad_sequences(sequences, maxlen=max_len)
y_data = np.array(labels)

# Load GloVe embeddings
embedding_dim = 100
embeddings_index = {}
with open('glove.6B.100d.txt', encoding='utf8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

# Prepare embedding matrix
word_index = tokenizer.word_index
num_words = min(max_words, len(word_index) + 1)
embedding_matrix = np.zeros((num_words, embedding_dim))
for word, i in word_index.items():
    if i >= max_words:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

# Build model
model = Sequential([
    Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_len, trainable=False),
    Dropout(0.3),
    LSTM(32),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train model
history = model.fit(x_data, y_data, epochs=10, batch_size=2, validation_split=0.3, verbose=0)

# Extract metrics
train_acc = history.history['accuracy'][-1] * 100
val_acc = history.history['val_accuracy'][-1] * 100
train_loss = history.history['loss'][-1]
val_loss = history.history['val_loss'][-1]

print(f'Training accuracy: {train_acc:.2f}%, Validation accuracy: {val_acc:.2f}%')
print(f'Training loss: {train_loss:.4f}, Validation loss: {val_loss:.4f}')

Loaded pre-trained GloVe embeddings and created an embedding matrix matching the tokenizer vocabulary.

Replaced the random embedding layer with a non-trainable embedding layer initialized with GloVe weights.

Added dropout after the embedding layer to reduce overfitting.

Results Interpretation

Before: Training accuracy: 85%, Validation accuracy: 75%, Training loss: 0.45, Validation loss: 0.65

After: Training accuracy: 82%, Validation accuracy: 83%, Training loss: 0.38, Validation loss: 0.42

Using pre-trained GloVe embeddings helps the model generalize better by providing meaningful word representations. Adding dropout reduces overfitting, improving validation accuracy.

Bonus Experiment

Try fine-tuning the GloVe embeddings by setting the embedding layer to trainable and observe the effect on validation accuracy.

💡 Hint

Set trainable=True in the embedding layer and train for a few more epochs with a lower learning rate.

Practice

(1/5)

1. What is the main purpose of GloVe embeddings in natural language processing?

easy

A. To generate random text based on input

B. To translate text from one language to another

C. To count the frequency of words in a document

D. To convert words into numerical vectors that capture meaning and relationships

GloVe embeddings in NLP - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand what embeddings do

Step 2: Identify GloVe's role

Final Answer:

Quick Check:

Solution

Step 1: Recall GloVe loading method

Step 2: Check options for correct syntax

Final Answer:

Quick Check:

Solution

Step 1: Understand similarity method

Step 2: Interpret expected similarity for 'king' and 'queen'

Final Answer:

Quick Check:

Solution

Step 1: Understand cause of KeyError

Step 2: Use safe access method

Final Answer:

Quick Check:

Solution

Step 1: Understand embedding layer initialization

Step 2: Handle unknown words and training

Final Answer:

Quick Check: