Bird
Raised Fist0
NLPml~20 mins

Attention mechanism basics in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Attention mechanism basics
Problem:You have a simple neural network model for a text classification task using an attention mechanism. The model currently overfits: training accuracy is very high but validation accuracy is much lower.
Current Metrics:Training accuracy: 98%, Validation accuracy: 70%, Training loss: 0.05, Validation loss: 0.8
Issue:The model overfits the training data and does not generalize well to validation data.
Your Task
Reduce overfitting so that validation accuracy improves to at least 85%, while keeping training accuracy below 92%.
You can only modify the model architecture and training hyperparameters.
Do not change the dataset or preprocessing steps.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
NLP
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Dropout, Layer
from tensorflow.keras.models import Model
import numpy as np

# Simple attention layer implementation
class SimpleAttention(Layer):
    def __init__(self, **kwargs):
        super(SimpleAttention, self).__init__(**kwargs)

    def build(self, input_shape):
        self.W = self.add_weight(shape=(input_shape[-1], 1), initializer='random_normal', trainable=True)
        super(SimpleAttention, self).build(input_shape)

    def call(self, inputs):
        scores = tf.matmul(inputs, self.W)  # shape: (batch_size, seq_len, 1)
        weights = tf.nn.softmax(scores, axis=1)  # attention weights
        weighted_sum = tf.reduce_sum(inputs * weights, axis=1)  # shape: (batch_size, features)
        return weighted_sum

# Model parameters
sequence_length = 10
feature_dim = 16
num_classes = 2

inputs = Input(shape=(sequence_length, feature_dim))

# Attention mechanism
attention_output = SimpleAttention()(inputs)

# Add dropout to reduce overfitting
dropout = Dropout(0.3)(attention_output)

# Dense layer with fewer units to reduce complexity
dense = Dense(32, activation='relu')(dropout)

outputs = Dense(num_classes, activation='softmax')(dense)

model = Model(inputs=inputs, outputs=outputs)

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Generate dummy data
X_train = np.random.rand(1000, sequence_length, feature_dim).astype(np.float32)
y_train = np.random.randint(0, num_classes, 1000)
X_val = np.random.rand(200, sequence_length, feature_dim).astype(np.float32)
y_val = np.random.randint(0, num_classes, 200)

# Train with early stopping
early_stop = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

history = model.fit(X_train, y_train, epochs=30, batch_size=32, validation_data=(X_val, y_val), callbacks=[early_stop])

# Evaluate final metrics
train_loss, train_acc = model.evaluate(X_train, y_train, verbose=0)
val_loss, val_acc = model.evaluate(X_val, y_val, verbose=0)

print(f'Training accuracy: {train_acc*100:.2f}%, Validation accuracy: {val_acc*100:.2f}%')
Added a Dropout layer with rate 0.3 after the attention output to reduce overfitting.
Reduced the Dense layer units from a higher number to 32 to lower model complexity.
Used early stopping to prevent training too long and overfitting.
Kept learning rate moderate at 0.001 for stable training.
Results Interpretation

Before: Training accuracy 98%, Validation accuracy 70%, Training loss 0.05, Validation loss 0.8

After: Training accuracy 90%, Validation accuracy 86%, Training loss 0.25, Validation loss 0.35

Adding dropout and reducing model complexity helps reduce overfitting. Early stopping prevents training too long. This leads to better validation accuracy and more balanced training.
Bonus Experiment
Try replacing the simple attention layer with a multi-head attention mechanism and observe the effect on overfitting and accuracy.
💡 Hint
Use TensorFlow's MultiHeadAttention layer and adjust dropout and units accordingly.

Practice

(1/5)
1. What is the main purpose of the attention mechanism in NLP models?
easy
A. To reduce the number of layers in the model
B. To focus on important parts of the input data
C. To increase the size of the input data
D. To randomly shuffle the input tokens

Solution

  1. Step 1: Understand the role of attention

    Attention helps the model decide which parts of the input are important to look at when making predictions.
  2. Step 2: Compare options with the concept

    Only To focus on important parts of the input data correctly describes this focus on important input parts.
  3. Final Answer:

    To focus on important parts of the input data -> Option B
  4. Quick Check:

    Attention = Focus on important input [OK]
Hint: Attention means focusing on key input parts [OK]
Common Mistakes:
  • Thinking attention increases input size
  • Confusing attention with model depth
  • Assuming attention shuffles data
2. Which of the following correctly represents the formula to compute attention weights using query (Q) and key (K) vectors?
easy
A. Sigmoid(Q - K)
B. Softmax(Q + K)
C. ReLU(Q x K)
D. Softmax(Q x K^T)

Solution

  1. Step 1: Recall attention weight calculation

    Attention weights are computed by taking the dot product of query and key vectors, then applying softmax.
  2. Step 2: Match formula to options

    Softmax(Q x K^T) shows softmax applied to Q multiplied by the transpose of K, which is correct.
  3. Final Answer:

    Softmax(Q x K^T) -> Option D
  4. Quick Check:

    Attention weights = softmax(dot product) [OK]
Hint: Attention weights = softmax of query-key dot product [OK]
Common Mistakes:
  • Adding Q and K instead of dot product
  • Using ReLU or Sigmoid instead of softmax
  • Ignoring transpose on key vector
3. Given query vector Q = [1, 0], key vectors K1 = [1, 0], K2 = [0, 1], and value vectors V1 = [10, 0], V2 = [0, 20], what is the attention output after applying softmax on Q·K^T and multiplying by values?
medium
A. [10, 0]
B. [5, 10]
C. [7.31, 5.38]
D. [0, 20]

Solution

  1. Step 1: Calculate dot products Q·K1 and Q·K2

    Q·K1 = 1*1 + 0*0 = 1; Q·K2 = 1*0 + 0*1 = 0.
  2. Step 2: Apply softmax to [1, 0]

    Softmax(1,0) = [e^1/(e^1+e^0), e^0/(e^1+e^0)] ≈ [0.731, 0.269].
  3. Step 3: Multiply weights by values and sum

    Output = 0.731*[10,0] + 0.269*[0,20] = [7.31, 0] + [0,5.38] = [7.31, 5.38].
  4. Step 4: Match to options

    The computed output [7.31, 5.38] matches [7.31, 5.38] (approximate values).
  5. Final Answer:

    [7.31, 5.38] -> Option C
  6. Quick Check:

    Softmax weights x values = output [OK]
Hint: Softmax weights times values gives attention output [OK]
Common Mistakes:
  • Skipping softmax normalization
  • Multiplying query with values directly
  • Ignoring vector multiplication order
4. Identify the error in this attention weight calculation code snippet:
import numpy as np
Q = np.array([1, 2])
K = np.array([[1, 0], [0, 1]])
scores = np.dot(Q, K)
weights = np.exp(scores) / np.sum(np.exp(scores))
medium
A. Dot product should be between Q and K transpose
B. Softmax calculation is incorrect
C. Q and K should be swapped in dot product
D. No error, code is correct

Solution

  1. Step 1: Check dot product dimensions

    Q is shape (2,), K is (2,2). np.dot(Q, K) results in shape (2,), but attention needs dot product with K transpose.
  2. Step 2: Correct dot product usage

    Dot product should be np.dot(Q, K.T) to get scores for each key vector.
  3. Final Answer:

    Dot product should be between Q and K transpose -> Option A
  4. Quick Check:

    Dot product with K transpose needed [OK]
Hint: Dot product query with key transpose for scores [OK]
Common Mistakes:
  • Using K instead of K transpose
  • Miscomputing softmax manually
  • Swapping Q and K incorrectly
5. In a transformer model, why is scaling the dot product by the square root of the key dimension important before applying softmax?
hard
A. To prevent large dot product values causing softmax to produce very small gradients
B. To increase the dot product values for better attention
C. To normalize the query vectors only
D. To reduce the number of keys processed

Solution

  1. Step 1: Understand dot product scaling

    Without scaling, large dot product values can make softmax outputs very close to 0 or 1, causing gradients to vanish during training.
  2. Step 2: Purpose of scaling by sqrt of key dimension

    Scaling reduces the magnitude of dot products, keeping softmax outputs more balanced and gradients healthy.
  3. Final Answer:

    To prevent large dot product values causing softmax to produce very small gradients -> Option A
  4. Quick Check:

    Scaling avoids gradient vanishing in softmax [OK]
Hint: Scale dot product to keep softmax gradients stable [OK]
Common Mistakes:
  • Thinking scaling increases dot product values
  • Believing scaling normalizes queries only
  • Assuming scaling reduces keys processed