0
0
NLPml~20 mins

Attention mechanism basics in NLP - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
๐ŸŽ–๏ธ
Attention Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
๐Ÿง  Conceptual
intermediate
2:00remaining
What is the main purpose of the attention mechanism in neural networks?
Choose the best explanation for why attention mechanisms are used in models like transformers.
ATo allow the model to focus on different parts of the input sequence when producing each output element.
BTo reduce the size of the input data by compressing it into a fixed vector.
CTo increase the number of layers in the neural network for deeper learning.
DTo randomly drop some neurons during training to prevent overfitting.
Attempts:
2 left
๐Ÿ’ก Hint
Think about how the model decides which words or tokens are important when generating output.
โ“ Predict Output
intermediate
2:00remaining
What is the output shape of the attention scores matrix?
Given a query matrix Q of shape (batch_size, seq_len_q, d_k) and a key matrix K of shape (batch_size, seq_len_k, d_k), what is the shape of the attention scores computed as Q ร— Kแต€?
NLP
import torch
batch_size = 2
seq_len_q = 3
seq_len_k = 4
d_k = 5
Q = torch.randn(batch_size, seq_len_q, d_k)
K = torch.randn(batch_size, seq_len_k, d_k)
attention_scores = torch.bmm(Q, K.transpose(1, 2))
print(attention_scores.shape)
A(2, 3, 4)
B(2, 5, 5)
C(3, 4, 5)
D(2, 4, 3)
Attempts:
2 left
๐Ÿ’ก Hint
Remember that matrix multiplication between Q and Kแต€ involves the last two dimensions.
โ“ Model Choice
advanced
2:00remaining
Which model architecture introduced the scaled dot-product attention?
Identify the model that first used scaled dot-product attention as a core component.
AAutoencoder
BRecurrent Neural Network (RNN)
CConvolutional Neural Network (CNN)
DTransformer
Attempts:
2 left
๐Ÿ’ก Hint
This model replaced recurrence with attention mechanisms.
โ“ Hyperparameter
advanced
2:00remaining
What is the effect of the scaling factor 1/โˆšd_k in scaled dot-product attention?
Why do we multiply the dot products by 1 divided by the square root of the key dimension (d_k) in attention calculations?
ATo increase the dot product values so the model focuses more on important tokens.
BTo reduce the sequence length by scaling down the keys.
CTo prevent the dot products from growing too large and causing softmax to have extremely small gradients.
DTo normalize the input embeddings before attention calculation.
Attempts:
2 left
๐Ÿ’ก Hint
Think about how large dot products affect the softmax function.
โ“ Metrics
expert
2:00remaining
How does attention mechanism impact model interpretability metrics?
Which statement best describes the relationship between attention weights and model interpretability?
AAttention weights reduce model accuracy but improve interpretability.
BAttention weights can offer insights but do not guarantee faithful explanations of model behavior.
CAttention weights are unrelated to interpretability and only affect training speed.
DAttention weights always provide a perfect explanation of model decisions.
Attempts:
2 left
๐Ÿ’ก Hint
Consider recent research findings on attention as explanation.