Bird
Raised Fist0
NLPml~20 mins

GloVe embeddings in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - GloVe embeddings
Problem:You want to use GloVe word embeddings to improve a text classification model. Currently, the model uses random embeddings and achieves 75% validation accuracy.
Current Metrics:Training accuracy: 85%, Validation accuracy: 75%, Training loss: 0.45, Validation loss: 0.65
Issue:The model is overfitting: training accuracy is much higher than validation accuracy, indicating it does not generalize well.
Your Task
Reduce overfitting by replacing random embeddings with pre-trained GloVe embeddings and improve validation accuracy to above 80%.
Keep the model architecture mostly the same except for the embedding layer.
Do not increase the model complexity significantly.
Use the GloVe embeddings of dimension 100.
Hint 1
Hint 2
Hint 3
Solution
NLP
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Sample data
texts = ['I love machine learning', 'Deep learning is fun', 'Natural language processing with embeddings']
labels = [1, 1, 0]

# Tokenize texts
max_words = 10000
max_len = 10
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
x_data = pad_sequences(sequences, maxlen=max_len)
y_data = np.array(labels)

# Load GloVe embeddings
embedding_dim = 100
embeddings_index = {}
with open('glove.6B.100d.txt', encoding='utf8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

# Prepare embedding matrix
word_index = tokenizer.word_index
num_words = min(max_words, len(word_index) + 1)
embedding_matrix = np.zeros((num_words, embedding_dim))
for word, i in word_index.items():
    if i >= max_words:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

# Build model
model = Sequential([
    Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_len, trainable=False),
    Dropout(0.3),
    LSTM(32),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train model
history = model.fit(x_data, y_data, epochs=10, batch_size=2, validation_split=0.3, verbose=0)

# Extract metrics
train_acc = history.history['accuracy'][-1] * 100
val_acc = history.history['val_accuracy'][-1] * 100
train_loss = history.history['loss'][-1]
val_loss = history.history['val_loss'][-1]

print(f'Training accuracy: {train_acc:.2f}%, Validation accuracy: {val_acc:.2f}%')
print(f'Training loss: {train_loss:.4f}, Validation loss: {val_loss:.4f}')
Loaded pre-trained GloVe embeddings and created an embedding matrix matching the tokenizer vocabulary.
Replaced the random embedding layer with a non-trainable embedding layer initialized with GloVe weights.
Added dropout after the embedding layer to reduce overfitting.
Results Interpretation

Before: Training accuracy: 85%, Validation accuracy: 75%, Training loss: 0.45, Validation loss: 0.65

After: Training accuracy: 82%, Validation accuracy: 83%, Training loss: 0.38, Validation loss: 0.42

Using pre-trained GloVe embeddings helps the model generalize better by providing meaningful word representations. Adding dropout reduces overfitting, improving validation accuracy.
Bonus Experiment
Try fine-tuning the GloVe embeddings by setting the embedding layer to trainable and observe the effect on validation accuracy.
💡 Hint
Set trainable=True in the embedding layer and train for a few more epochs with a lower learning rate.

Practice

(1/5)
1. What is the main purpose of GloVe embeddings in natural language processing?
easy
A. To generate random text based on input
B. To translate text from one language to another
C. To count the frequency of words in a document
D. To convert words into numerical vectors that capture meaning and relationships

Solution

  1. Step 1: Understand what embeddings do

    Embeddings convert words into numbers so machines can understand text.
  2. Step 2: Identify GloVe's role

    GloVe embeddings specifically capture word meanings and relationships in vector form.
  3. Final Answer:

    To convert words into numerical vectors that capture meaning and relationships -> Option D
  4. Quick Check:

    GloVe = word vectors capturing meaning [OK]
Hint: Remember: embeddings = words to numbers showing meaning [OK]
Common Mistakes:
  • Confusing embeddings with translation
  • Thinking embeddings count word frequency
  • Assuming embeddings generate text
2. Which of the following is the correct way to load pre-trained GloVe embeddings in Python using the gensim library?
easy
A. glove = gensim.models.FastText.load('glove.txt')
B. glove = gensim.models.Word2Vec.load('glove.txt')
C. glove = gensim.models.KeyedVectors.load_word2vec_format('glove.txt', binary=False)
D. glove = gensim.load('glove.txt')

Solution

  1. Step 1: Recall GloVe loading method

    GloVe embeddings are loaded as KeyedVectors using load_word2vec_format with binary=False.
  2. Step 2: Check options for correct syntax

    glove = gensim.models.KeyedVectors.load_word2vec_format('glove.txt', binary=False) uses the correct function and parameters for GloVe format.
  3. Final Answer:

    glove = gensim.models.KeyedVectors.load_word2vec_format('glove.txt', binary=False) -> Option C
  4. Quick Check:

    Use load_word2vec_format with binary=False for GloVe [OK]
Hint: Use load_word2vec_format with binary=False for GloVe files [OK]
Common Mistakes:
  • Using Word2Vec.load for GloVe files
  • Forgetting binary=False parameter
  • Using FastText load for GloVe
3. Given the following Python code snippet using pre-trained GloVe embeddings, what will be the output?
from gensim.models import KeyedVectors

glove = KeyedVectors.load_word2vec_format('glove.6B.50d.txt', binary=False)
result = glove.similarity('king', 'queen')
print(round(result, 2))
medium
A. 0.00
B. 0.78
C. 1.00
D. -0.50

Solution

  1. Step 1: Understand similarity method

    The similarity method returns a cosine similarity score between two word vectors, usually between 0 and 1 for related words.
  2. Step 2: Interpret expected similarity for 'king' and 'queen'

    These words are closely related, so the similarity is high but less than 1, typically around 0.78.
  3. Final Answer:

    0.78 -> Option B
  4. Quick Check:

    Similarity('king','queen') ≈ 0.78 [OK]
Hint: Related words have similarity close to but less than 1 [OK]
Common Mistakes:
  • Assuming similarity is always 1 for related words
  • Confusing similarity with distance
  • Expecting negative similarity for related words
4. You try to find the vector for the word 'unseenword' using GloVe embeddings with this code:
vector = glove['unseenword']
But it raises a KeyError. What is the best way to fix this error?
medium
A. Check if the word exists in the embeddings before accessing it
B. Use glove.get_vector('unseenword') without checking
C. Ignore the error and continue
D. Restart the Python kernel

Solution

  1. Step 1: Understand cause of KeyError

    The word 'unseenword' is not in the GloVe vocabulary, so direct access raises KeyError.
  2. Step 2: Use safe access method

    Check if the word exists using 'if word in glove' before accessing to avoid errors.
  3. Final Answer:

    Check if the word exists in the embeddings before accessing it -> Option A
  4. Quick Check:

    Check word presence before access to avoid KeyError [OK]
Hint: Always check word in embeddings before access [OK]
Common Mistakes:
  • Trying to access vectors without checking existence
  • Ignoring errors instead of handling them
  • Restarting kernel does not fix missing words
5. You want to improve a text classification model by using GloVe embeddings. Which approach best combines GloVe vectors with your model to handle words not in the GloVe vocabulary?
hard
A. Initialize an embedding layer with GloVe vectors and allow it to be trainable with random vectors for unknown words
B. Use only GloVe vectors and ignore unknown words during training
C. Replace unknown words with a fixed zero vector and freeze the embedding layer
D. Train a new embedding from scratch without using GloVe

Solution

  1. Step 1: Understand embedding layer initialization

    Initializing with GloVe vectors provides good starting word representations.
  2. Step 2: Handle unknown words and training

    Allowing the embedding layer to be trainable lets the model learn vectors for unknown words starting from random initialization.
  3. Final Answer:

    Initialize an embedding layer with GloVe vectors and allow it to be trainable with random vectors for unknown words -> Option A
  4. Quick Check:

    Trainable embeddings + GloVe + random unknown vectors = best practice [OK]
Hint: Use trainable embeddings with GloVe plus random unknown vectors [OK]
Common Mistakes:
  • Ignoring unknown words instead of learning their vectors
  • Freezing embeddings and losing adaptability
  • Not using pre-trained GloVe vectors at all