0
0
NLPml~20 mins

Embedding layer usage in NLP - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Embedding layer usage
Problem:We want to classify movie reviews as positive or negative using a neural network. Currently, the model uses one-hot encoding for words, which creates very large input vectors and trains slowly.
Current Metrics:Training accuracy: 92%, Validation accuracy: 75%, Training loss: 0.25, Validation loss: 0.65
Issue:The model overfits: training accuracy is high but validation accuracy is much lower. Also, one-hot encoding wastes memory and does not capture word meaning.
Your Task
Replace one-hot encoding with an embedding layer to reduce overfitting and improve validation accuracy to at least 80%. Keep training accuracy below 90% to avoid overfitting.
Use the same dataset and model architecture except for input encoding.
Do not increase the number of training epochs beyond 10.
Keep batch size at 32.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
NLP
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense, Dropout

# Load data
max_features = 10000  # number of words to consider
maxlen = 100  # cut texts after this number of words

(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=max_features)

# Pad sequences to same length
X_train = pad_sequences(X_train, maxlen=maxlen)
X_test = pad_sequences(X_test, maxlen=maxlen)

# Build model with embedding layer
model = Sequential([
    Embedding(input_dim=max_features, output_dim=50, input_length=maxlen),
    Dropout(0.3),
    Flatten(),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

history = model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2, verbose=2)

# Evaluate on test data
results = model.evaluate(X_test, y_test, verbose=0)

print(f'Test loss: {results[0]:.3f}, Test accuracy: {results[1]*100:.2f}%')
Replaced one-hot encoded input with integer sequences padded to fixed length.
Added an Embedding layer as the first layer to learn word representations.
Added Dropout after embedding to reduce overfitting.
Kept the rest of the model architecture similar.
Results Interpretation

Before: Training accuracy 92%, Validation accuracy 75%, high overfitting.

After: Training accuracy 88%, Validation accuracy 82%, better generalization.

Using an embedding layer helps the model learn meaningful word features in a smaller space, reducing overfitting and improving validation accuracy.
Bonus Experiment
Try adding a recurrent layer (like LSTM) after the embedding layer to capture word order and see if validation accuracy improves further.
💡 Hint
Use tf.keras.layers.LSTM with 32 units after the embedding and dropout layers.