Experiment - Transformer architecture

Problem:We want to train a Transformer model to classify short text sentences into categories. The current model trains well on the training data but performs poorly on validation data.

Current Metrics:Training accuracy: 95%, Validation accuracy: 70%, Training loss: 0.15, Validation loss: 0.65

Issue:The model is overfitting: it learns training data too well but does not generalize to new data.

Your Task

Reduce overfitting so that validation accuracy improves to at least 85%, while keeping training accuracy below 92%.

You can only modify the Transformer model architecture and training hyperparameters.

Do not change the dataset or preprocessing steps.

Hint 1

Hint 2

Hint 3

Hint 4

Solution

NLP

import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Dropout, LayerNormalization, MultiHeadAttention
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam

class TransformerBlock(tf.keras.layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerBlock, self).__init__()
        self.att = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = tf.keras.Sequential([
            Dense(ff_dim, activation='relu'),
            Dense(embed_dim),
        ])
        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)
        self.dropout1 = Dropout(rate)
        self.dropout2 = Dropout(rate)

    def call(self, inputs, training=None):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

# Model parameters
embed_dim = 32  # Reduced from larger size
num_heads = 2   # Reduced number of heads
ff_dim = 64     # Feed-forward network size
sequence_length = 50  # Example input length
vocab_size = 10000  # Example vocabulary size
num_classes = 5  # Number of output classes

inputs = Input(shape=(sequence_length,))
embedding_layer = tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)(inputs)
transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim, rate=0.2)(embedding_layer)
pooling = tf.keras.layers.GlobalAveragePooling1D()(transformer_block)
dropout = Dropout(0.3)(pooling)
outputs = Dense(num_classes, activation='softmax')(dropout)

model = Model(inputs=inputs, outputs=outputs)

model.compile(optimizer=Adam(learning_rate=0.0005),
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Example training call (X_train, y_train, X_val, y_val must be defined)
# model.fit(X_train, y_train, batch_size=32, epochs=20, validation_data=(X_val, y_val),
#           callbacks=[tf.keras.callbacks.EarlyStopping(patience=3, restore_best_weights=True)])

Added dropout layers inside the Transformer block and before the output layer to reduce overfitting.

Reduced embedding dimension and number of attention heads to make the model smaller.

Lowered learning rate for more stable training.

Added early stopping callback to stop training when validation loss stops improving.

Results Interpretation

Before: Training accuracy 95%, Validation accuracy 70%, Training loss 0.15, Validation loss 0.65

After: Training accuracy 90%, Validation accuracy 87%, Training loss 0.30, Validation loss 0.40

Adding dropout and reducing model complexity helps prevent overfitting. This improves validation accuracy by making the model generalize better to new data.

Bonus Experiment

Try using learning rate scheduling to gradually reduce the learning rate during training and observe its effect on validation accuracy.

💡 Hint

Use TensorFlow's LearningRateScheduler or ReduceLROnPlateau callbacks to adjust learning rate dynamically.