Challenge - 5 Problems

🎖️

Transformer Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

2:00remaining

Understanding the Attention Mechanism in Transformers

Which statement best describes the role of the attention mechanism in a Transformer model?

AIt reduces the size of the input data by compressing it into a fixed-length vector.

BIt applies convolutional filters to extract local features from the input sequence.

CIt normalizes the input data to have zero mean and unit variance before processing.

DIt allows the model to focus on different parts of the input sequence when producing each output element.

Attempts:

2 left

❓ Predict Output

intermediate

2:00remaining

Output Shape of Transformer Encoder Layer

Given an input tensor of shape (batch_size=4, sequence_length=10, embedding_dim=64) passed through a Transformer encoder layer with the same embedding dimension, what will be the shape of the output tensor?

Simulink

input_shape = (4, 10, 64)
# Transformer encoder layer with embedding_dim=64
output_shape = (4, 10, 64)

A(10, 4, 64)

B(4, 64, 10)

C(4, 10, 64)

D(4, 10)

Attempts:

2 left

❓ Hyperparameter

advanced

2:00remaining

Choosing the Number of Attention Heads

In a Transformer model, if the embedding dimension is 128, which choice of number of attention heads is valid and why?

A5, because it is the default number in most libraries.

B8, because 128 divided by 8 equals 16, which is an integer dimension per head.

C10, because more heads always improve performance regardless of embedding size.

D7, because prime numbers improve model generalization.

Attempts:

2 left

❓ Metrics

advanced

2:00remaining

Evaluating Transformer Model Performance

Which metric is most appropriate to evaluate a Transformer model trained for a multi-class text classification task?

AAccuracy, because it measures the proportion of correctly predicted classes.

BMean Squared Error, because it measures the average squared difference between predicted and true values.

CBLEU score, because it evaluates the quality of generated text sequences.

DPerplexity, because it measures the uncertainty of a language model.

Attempts:

2 left

🔧 Debug

expert

2:00remaining

Identifying the Cause of Training Instability in a Transformer

A Transformer model training suddenly diverges with loss becoming NaN after a few epochs. Which of the following is the most likely cause?

AThe learning rate is too high, causing gradient explosion.

BThe batch size is too small, causing underfitting.

CThe embedding dimension is too low, causing poor representation.

DThe number of attention heads is too large, causing overfitting.

Attempts:

2 left