Challenge - 5 Problems

🎖️

Attention Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

2:00remaining

Why is attention important in deep learning?

Which of the following best explains why attention mechanisms improved deep learning models?

AAttention replaces activation functions like ReLU in neural networks.

BAttention reduces the number of layers needed in a neural network.

CAttention allows models to focus on relevant parts of the input, improving context understanding.

DAttention eliminates the need for training data by generating synthetic examples.

Attempts:

2 left

❓ Predict Output

intermediate

2:00remaining

Output of scaled dot-product attention scores

What is the output tensor after computing scaled dot-product attention scores for the given query and key tensors?

PyTorch

import torch
import torch.nn.functional as F

query = torch.tensor([[1., 0., 1.]])  # shape (1, 3)
key = torch.tensor([[1., 0., 0.], [0., 1., 1.]])  # shape (2, 3)

scores = torch.matmul(query, key.T) / (3 ** 0.5)
output = F.softmax(scores, dim=1)
print(output)

Atensor([[1.0, 0.0]])

Btensor([[0.7311, 0.2689]])

Ctensor([[0.5, 0.5]])

Dtensor([[0.2689, 0.7311]])

Attempts:

2 left

❓ Model Choice

advanced

2:00remaining

Choosing the right model for sequence tasks with attention

Which model architecture best uses attention to handle long-range dependencies in sequences?

ATransformer model with self-attention layers

BConvolutional Neural Network (CNN) for images

CRecurrent Neural Network (RNN) without attention

DFeedforward neural network with no sequence input

Attempts:

2 left

❓ Hyperparameter

advanced

2:00remaining

Effect of attention head count on model performance

What is the typical effect of increasing the number of attention heads in a multi-head attention layer?

AIt allows the model to attend to information from multiple representation subspaces, improving learning.

BIt causes the model to ignore input sequence order completely.

CIt decreases the model's ability to learn by reducing parameter count.

DIt always leads to overfitting regardless of dataset size.

Attempts:

2 left

❓ Metrics

expert

2:00remaining

Interpreting attention weights for model explainability

Given a trained attention model, which metric best helps quantify how focused the attention distribution is on a few key inputs?

ATraining loss value after first epoch

BMean squared error of model predictions

CNumber of layers in the model

DEntropy of the attention weights distribution

Attempts:

2 left