0
0
NLPml~20 mins

Transformer architecture in NLP - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Transformer Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
What is the main purpose of the self-attention mechanism in a Transformer?

In the Transformer model, the self-attention mechanism helps the model to:

AFocus on different parts of the input sequence to understand relationships between words.
BReduce the size of the input data by compressing it into a smaller vector.
CGenerate random noise to improve model robustness during training.
DSort the input words in alphabetical order before processing.
Attempts:
2 left
💡 Hint

Think about how the model learns connections between words regardless of their position.

Predict Output
intermediate
2:00remaining
Output shape after multi-head attention layer

Given the following code snippet using PyTorch, what is the shape of the output tensor?

NLP
import torch
import torch.nn as nn

batch_size = 2
seq_len = 5
embed_dim = 16
num_heads = 4

x = torch.rand(batch_size, seq_len, embed_dim)
mha = nn.MultiheadAttention(embed_dim=embed_dim, num_heads=num_heads)

# PyTorch MultiheadAttention expects input shape (seq_len, batch_size, embed_dim)
x_t = x.transpose(0, 1)
out, _ = mha(x_t, x_t, x_t)

output_shape = out.shape
A(2, 16, 5)
B(5, 2, 16)
C(2, 5, 16)
D(5, 16, 2)
Attempts:
2 left
💡 Hint

Check the input and output shapes expected by PyTorch's MultiheadAttention.

Hyperparameter
advanced
2:00remaining
Choosing the number of attention heads

Which of the following is a valid reason to increase the number of attention heads in a Transformer model?

ATo reduce the total number of parameters in the model for faster training.
BTo convert the model from a Transformer to a convolutional neural network.
CTo ensure the model only focuses on the first few words of the input sequence.
DTo allow the model to attend to information from multiple representation subspaces at different positions.
Attempts:
2 left
💡 Hint

Think about how multiple heads help the model understand different aspects of the input.

Metrics
advanced
2:00remaining
Interpreting training loss in Transformer models

During training a Transformer for language modeling, the loss decreases steadily but the validation loss starts increasing after some epochs. What does this indicate?

AThe model is overfitting the training data and not generalizing well to new data.
BThe model is underfitting and needs more training epochs.
CThe training data is corrupted and causing unstable loss values.
DThe optimizer is not updating the model weights correctly.
Attempts:
2 left
💡 Hint

Consider what it means when training loss improves but validation loss worsens.

🔧 Debug
expert
3:00remaining
Identifying error in Transformer positional encoding implementation

Consider this simplified code snippet for positional encoding in a Transformer. What error will this code raise when run?

NLP
import torch
import math

def positional_encoding(seq_len, d_model):
    pe = torch.zeros(seq_len, d_model)
    position = torch.arange(0, seq_len).unsqueeze(1)
    div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    return pe

pos_enc = positional_encoding(10, 7)
ASyntaxError because of missing colon in function definition
BTypeError because torch.arange returns a list, not a tensor
CRuntimeError due to shape mismatch when assigning to pe[:, 1::2]
DNo error, code runs correctly and returns positional encoding tensor
Attempts:
2 left
💡 Hint

Check the shapes of slices pe[:, 0::2] and pe[:, 1::2] when d_model is odd.