NLPml~20 mins

Attention mechanism basics in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Attention mechanism basics

Problem:You have a simple neural network model for a text classification task using an attention mechanism. The model currently overfits: training accuracy is very high but validation accuracy is much lower.

Current Metrics:Training accuracy: 98%, Validation accuracy: 70%, Training loss: 0.05, Validation loss: 0.8

Issue:The model overfits the training data and does not generalize well to validation data.

Your Task

Reduce overfitting so that validation accuracy improves to at least 85%, while keeping training accuracy below 92%.

You can only modify the model architecture and training hyperparameters.

Do not change the dataset or preprocessing steps.

Hint 1

Hint 2

Hint 3

Hint 4

Solution

NLP

import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Dropout, Layer
from tensorflow.keras.models import Model
import numpy as np

# Simple attention layer implementation
class SimpleAttention(Layer):
    def __init__(self, **kwargs):
        super(SimpleAttention, self).__init__(**kwargs)

    def build(self, input_shape):
        self.W = self.add_weight(shape=(input_shape[-1], 1), initializer='random_normal', trainable=True)
        super(SimpleAttention, self).build(input_shape)

    def call(self, inputs):
        scores = tf.matmul(inputs, self.W)  # shape: (batch_size, seq_len, 1)
        weights = tf.nn.softmax(scores, axis=1)  # attention weights
        weighted_sum = tf.reduce_sum(inputs * weights, axis=1)  # shape: (batch_size, features)
        return weighted_sum

# Model parameters
sequence_length = 10
feature_dim = 16
num_classes = 2

inputs = Input(shape=(sequence_length, feature_dim))

# Attention mechanism
attention_output = SimpleAttention()(inputs)

# Add dropout to reduce overfitting
dropout = Dropout(0.3)(attention_output)

# Dense layer with fewer units to reduce complexity
dense = Dense(32, activation='relu')(dropout)

outputs = Dense(num_classes, activation='softmax')(dense)

model = Model(inputs=inputs, outputs=outputs)

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Generate dummy data
X_train = np.random.rand(1000, sequence_length, feature_dim).astype(np.float32)
y_train = np.random.randint(0, num_classes, 1000)
X_val = np.random.rand(200, sequence_length, feature_dim).astype(np.float32)
y_val = np.random.randint(0, num_classes, 200)

# Train with early stopping
early_stop = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

history = model.fit(X_train, y_train, epochs=30, batch_size=32, validation_data=(X_val, y_val), callbacks=[early_stop])

# Evaluate final metrics
train_loss, train_acc = model.evaluate(X_train, y_train, verbose=0)
val_loss, val_acc = model.evaluate(X_val, y_val, verbose=0)

print(f'Training accuracy: {train_acc*100:.2f}%, Validation accuracy: {val_acc*100:.2f}%')

Added a Dropout layer with rate 0.3 after the attention output to reduce overfitting.

Reduced the Dense layer units from a higher number to 32 to lower model complexity.

Used early stopping to prevent training too long and overfitting.

Kept learning rate moderate at 0.001 for stable training.

Results Interpretation

Before: Training accuracy 98%, Validation accuracy 70%, Training loss 0.05, Validation loss 0.8

After: Training accuracy 90%, Validation accuracy 86%, Training loss 0.25, Validation loss 0.35

Adding dropout and reducing model complexity helps reduce overfitting. Early stopping prevents training too long. This leads to better validation accuracy and more balanced training.

Bonus Experiment

Try replacing the simple attention layer with a multi-head attention mechanism and observe the effect on overfitting and accuracy.

💡 Hint

Use TensorFlow's MultiHeadAttention layer and adjust dropout and units accordingly.

Practice

(1/5)

1. What is the main purpose of the attention mechanism in NLP models?

easy

A. To reduce the number of layers in the model

B. To focus on important parts of the input data

C. To increase the size of the input data

D. To randomly shuffle the input tokens

Attention mechanism basics in NLP - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand the role of attention

Step 2: Compare options with the concept

Final Answer:

Quick Check:

Solution

Step 1: Recall attention weight calculation

Step 2: Match formula to options

Final Answer:

Quick Check:

Solution

Step 1: Calculate dot products Q·K1 and Q·K2

Step 2: Apply softmax to [1, 0]

Step 3: Multiply weights by values and sum

Step 4: Match to options

Final Answer:

Quick Check:

Solution

Step 1: Check dot product dimensions

Step 2: Correct dot product usage

Final Answer:

Quick Check:

Solution

Step 1: Understand dot product scaling

Step 2: Purpose of scaling by sqrt of key dimension

Final Answer:

Quick Check: