Bird
Raised Fist0
NLPml~20 mins

Attention mechanism in depth in NLP - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
Attention Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of scaled dot-product attention calculation
What is the output of the following scaled dot-product attention calculation code snippet?
NLP
import torch
import torch.nn.functional as F

query = torch.tensor([[1.0, 0.0, 1.0]])  # shape (1, 3)
key = torch.tensor([[1.0, 0.0, 1.0], [0.0, 1.0, 0.0]])  # shape (2, 3)
value = torch.tensor([[1.0, 2.0], [3.0, 4.0]])  # shape (2, 2)

# Calculate attention scores
scores = torch.matmul(query, key.T) / (key.shape[1] ** 0.5)  # scale by sqrt(d_k)
weights = F.softmax(scores, dim=1)

# Weighted sum of values
output = torch.matmul(weights, value)
print(output)
A[[1.4795 2.4795]]
B[[3.0 4.0]]
C[[1.5 2.5]]
D[[2.0 3.0]]
Attempts:
2 left
💡 Hint
Recall that scaled dot-product attention uses softmax on scaled scores to get weights.
🧠 Conceptual
intermediate
1:30remaining
Purpose of the 'key' in attention mechanism
In the attention mechanism, what is the main role of the 'key' vectors?
AThey normalize the query vectors before similarity calculation.
BThey are used to generate the final output directly without matching.
CThey represent the information to be retrieved by matching with the query.
DThey store the positional encoding of the input sequence.
Attempts:
2 left
💡 Hint
Think about how queries find relevant information in keys.
Hyperparameter
advanced
2:00remaining
Effect of increasing number of attention heads
What is the main effect of increasing the number of attention heads in a multi-head attention model?
AIt eliminates the need for positional encoding.
BIt reduces the total number of parameters in the model.
CIt increases the sequence length the model can process without changing computation.
DIt allows the model to jointly attend to information from different representation subspaces at different positions.
Attempts:
2 left
💡 Hint
Think about how multiple heads help the model see different aspects of the input.
Metrics
advanced
1:30remaining
Interpreting attention weights distribution
If an attention layer outputs very uniform attention weights across all positions, what does this typically indicate about the model's focus?
AThe model is confidently focusing on a single important position.
BThe model is uncertain and is distributing focus evenly, possibly due to lack of strong signals.
CThe model has overfitted and memorized the training data.
DThe model is ignoring the input sequence entirely.
Attempts:
2 left
💡 Hint
Uniform weights mean no position stands out more than others.
🔧 Debug
expert
2:30remaining
Identifying error in custom attention implementation
Consider this simplified custom attention code snippet. What error will it raise when run? ```python import torch import torch.nn.functional as F def custom_attention(query, key, value): scores = torch.matmul(query, key) # shape mismatch possible weights = F.softmax(scores, dim=-1) output = torch.matmul(weights, value) return output q = torch.randn(1, 3) k = torch.randn(2, 3) v = torch.randn(2, 4) result = custom_attention(q, k, v) print(result) ```
ARuntimeError due to shape mismatch in torch.matmul(query, key)
BRuntimeError due to shape mismatch in torch.matmul(weights, value)
CNo error; outputs a tensor of shape (1, 4)
DSyntaxError due to missing colon in function definition
Attempts:
2 left
💡 Hint
Check the shapes of query and key tensors before multiplication.

Practice

(1/5)
1. What is the main purpose of the attention mechanism in NLP models?
easy
A. To increase the size of the input data
B. To reduce the number of layers in the model
C. To help the model focus on important parts of the input data
D. To randomly shuffle the input tokens

Solution

  1. Step 1: Understand attention's role

    Attention helps models decide which parts of the input are most important for the task.
  2. Step 2: Compare options

    Only To help the model focus on important parts of the input data correctly describes this focus mechanism; others describe unrelated actions.
  3. Final Answer:

    To help the model focus on important parts of the input data -> Option C
  4. Quick Check:

    Attention = Focus on important input [OK]
Hint: Remember: attention means focusing on key input parts [OK]
Common Mistakes:
  • Thinking attention changes input size
  • Confusing attention with model depth
  • Assuming attention shuffles data
2. Which of the following correctly represents the formula for attention weights using queries (Q), keys (K), and softmax?
easy
A. softmax(Q x K^T)
B. Q + K
C. softmax(Q - K)
D. Q x K

Solution

  1. Step 1: Recall attention weight calculation

    Attention weights are computed by multiplying queries with keys transposed, then applying softmax.
  2. Step 2: Evaluate options

    Only softmax(Q x K^T) matches the correct formula softmax(Q x K^T). Others are incorrect operations.
  3. Final Answer:

    softmax(Q x K^T) -> Option A
  4. Quick Check:

    Attention weights = softmax(Q x K^T) [OK]
Hint: Attention weights = softmax of query-key dot product [OK]
Common Mistakes:
  • Using addition instead of multiplication
  • Forgetting to transpose keys
  • Skipping softmax normalization
3. Given queries Q = [[1, 0]], keys K = [[1, 0], [-10, 1]], and values V = [[10, 20], [30, 40]], what is the output of the attention mechanism (using dot product and softmax)?
medium
A. [[10, 20]]
B. [[20, 30]]
C. [[20, 40]]
D. [[30, 40]]

Solution

  1. Step 1: Calculate dot products Q x K^T

    Q = [1,0], K = [[1,0],[-10,1]]; dot products: [1*1+0*0=1, 1*(-10)+0*1=-10]
  2. Step 2: Apply softmax to scores

    softmax([1,-10]) ≈ [1, 0] (e^{-10} negligible)
  3. Step 3: Compute weighted sum of values

    Output ≈ 1*[10,20] + 0*[30,40] = [[10, 20]]
  4. Step 4: Match option

    [[10, 20]] matches exactly.
  5. Final Answer:

    [[10, 20]] -> Option A
  6. Quick Check:

    Weighted sum of values = [[10, 20]] [OK]
Hint: Calculate dot, softmax, then weighted sum of values [OK]
Common Mistakes:
  • Skipping softmax normalization
  • Using keys instead of values for output
  • Incorrect dot product calculation
4. Identify the error in this attention weight calculation code snippet:
import numpy as np
Q = np.array([[1, 0]])
K = np.array([[1, 0], [-10, 1]])
scores = np.dot(Q, K)
weights = np.exp(scores) / np.sum(np.exp(scores))
medium
A. Values are missing in the calculation
B. Softmax is applied incorrectly
C. Queries and keys have incompatible shapes
D. Keys should be transposed before dot product

Solution

  1. Step 1: Check dot product operation

    Dot product should be between Q and K transposed to align dimensions correctly.
  2. Step 2: Analyze code

    Code uses np.dot(Q, K) without transposing K, causing wrong shape and incorrect scores.
  3. Final Answer:

    Keys should be transposed before dot product -> Option D
  4. Quick Check:

    Transpose keys before dot product [OK]
Hint: Always transpose keys before dot product with queries [OK]
Common Mistakes:
  • Forgetting to transpose keys
  • Misapplying softmax formula
  • Ignoring shape compatibility
5. In a transformer model, why is scaling the dot product by the square root of the key dimension important before applying softmax?
hard
A. To increase the dot product values for better attention
B. To prevent large dot product values causing very small gradients
C. To normalize the values between 0 and 1
D. To reduce the number of keys used in attention

Solution

  1. Step 1: Understand dot product scaling

    Large dot products can cause softmax to produce very small gradients, slowing learning.
  2. Step 2: Role of scaling by sqrt of key dimension

    Scaling reduces dot product magnitude, stabilizing gradients and improving training.
  3. Final Answer:

    To prevent large dot product values causing very small gradients -> Option B
  4. Quick Check:

    Scaling avoids tiny gradients in softmax [OK]
Hint: Scale dot product to keep gradients healthy [OK]
Common Mistakes:
  • Thinking scaling increases dot product
  • Confusing scaling with normalization to [0,1]
  • Assuming scaling reduces keys count