0
0
PyTorchml~20 mins

Transformer encoder in PyTorch - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Transformer Encoder Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output shape of Transformer encoder layer
Given the following PyTorch code snippet, what is the shape of the output tensor after passing through the Transformer encoder layer?
PyTorch
import torch
import torch.nn as nn

batch_size = 4
seq_length = 10
embedding_dim = 32

x = torch.rand(batch_size, seq_length, embedding_dim)
encoder_layer = nn.TransformerEncoderLayer(d_model=embedding_dim, nhead=4, batch_first=True)
output = encoder_layer(x)
print(output.shape)
Atorch.Size([10, 4, 32])
Btorch.Size([4, 10, 32])
Ctorch.Size([4, 32, 10])
Dtorch.Size([10, 32, 4])
Attempts:
2 left
💡 Hint
Remember that nn.TransformerEncoderLayer expects input shape (batch_size, seq_length, embedding_dim) and returns the same shape.
Model Choice
intermediate
1:30remaining
Choosing the number of attention heads
You want to create a Transformer encoder layer with embedding dimension 64. Which choice of number of attention heads is valid?
A8
B9
C10
D7
Attempts:
2 left
💡 Hint
The embedding dimension must be divisible by the number of attention heads.
Hyperparameter
advanced
1:30remaining
Effect of increasing dropout in Transformer encoder
What is the most likely effect of increasing the dropout rate in a Transformer encoder layer during training?
AIt speeds up training by skipping computations.
BIt increases model capacity by adding more neurons.
CIt reduces overfitting by randomly zeroing some activations, improving generalization.
DIt always causes the model to underfit and perform worse.
Attempts:
2 left
💡 Hint
Dropout randomly disables parts of the network during training.
🔧 Debug
advanced
2:00remaining
Identifying error in Transformer encoder input shape
What error will this code raise when running the Transformer encoder layer?
PyTorch
import torch
import torch.nn as nn

x = torch.rand(10, 4, 32)  # shape (seq_length, batch_size, embedding_dim)
encoder_layer = nn.TransformerEncoderLayer(d_model=32, nhead=4, batch_first=True)
output = encoder_layer(x)
print(output.shape)
ARuntimeError: Expected input of shape (batch_size, seq_length, embedding_dim)
BNo error, output shape is torch.Size([10, 4, 32])
CTypeError: input tensor must be 2D
DValueError: nhead must divide d_model
Attempts:
2 left
💡 Hint
Check the expected input shape for nn.TransformerEncoderLayer in PyTorch.
🧠 Conceptual
expert
2:30remaining
Why use multi-head attention in Transformer encoder?
What is the main advantage of using multi-head attention instead of a single attention head in a Transformer encoder?
AIt reduces the total number of parameters in the model.
BIt prevents the model from learning positional information.
CIt makes the model faster by parallelizing computations across heads.
DIt allows the model to jointly attend to information from different representation subspaces at different positions.
Attempts:
2 left
💡 Hint
Think about how multiple attention heads help the model understand different aspects of the input.