Challenge - 5 Problems

🎖️

Transformer Encoder Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output shape of Transformer encoder layer

Given the following PyTorch code snippet, what is the shape of the output tensor after passing through the Transformer encoder layer?

PyTorch

import torch
import torch.nn as nn

batch_size = 4
seq_length = 10
embedding_dim = 32

x = torch.rand(batch_size, seq_length, embedding_dim)
encoder_layer = nn.TransformerEncoderLayer(d_model=embedding_dim, nhead=4, batch_first=True)
output = encoder_layer(x)
print(output.shape)

Atorch.Size([10, 4, 32])

Btorch.Size([4, 10, 32])

Ctorch.Size([4, 32, 10])

Dtorch.Size([10, 32, 4])

Attempts:

2 left

❓ Model Choice

intermediate

1:30remaining

Choosing the number of attention heads

You want to create a Transformer encoder layer with embedding dimension 64. Which choice of number of attention heads is valid?

C10

Attempts:

2 left

❓ Hyperparameter

advanced

1:30remaining

Effect of increasing dropout in Transformer encoder

What is the most likely effect of increasing the dropout rate in a Transformer encoder layer during training?

AIt speeds up training by skipping computations.

BIt increases model capacity by adding more neurons.

CIt reduces overfitting by randomly zeroing some activations, improving generalization.

DIt always causes the model to underfit and perform worse.

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Identifying error in Transformer encoder input shape

What error will this code raise when running the Transformer encoder layer?

PyTorch

import torch
import torch.nn as nn

x = torch.rand(10, 4, 32)  # shape (seq_length, batch_size, embedding_dim)
encoder_layer = nn.TransformerEncoderLayer(d_model=32, nhead=4, batch_first=True)
output = encoder_layer(x)
print(output.shape)

ARuntimeError: Expected input of shape (batch_size, seq_length, embedding_dim)

BNo error, output shape is torch.Size([10, 4, 32])

CTypeError: input tensor must be 2D

DValueError: nhead must divide d_model

Attempts:

2 left

🧠 Conceptual

expert

2:30remaining

Why use multi-head attention in Transformer encoder?

What is the main advantage of using multi-head attention instead of a single attention head in a Transformer encoder?

AIt reduces the total number of parameters in the model.

BIt prevents the model from learning positional information.

CIt makes the model faster by parallelizing computations across heads.

DIt allows the model to jointly attend to information from different representation subspaces at different positions.

Attempts:

2 left