Practice

(1/5)

1. What is the main purpose of the attention mechanism in NLP models?

easy

A. To reduce the number of layers in the model

B. To focus on important parts of the input data

C. To increase the size of the input data

D. To randomly shuffle the input tokens

Solution

Step 1: Understand the role of attention
Attention helps the model decide which parts of the input are important to look at when making predictions.
Step 2: Compare options with the concept
Only To focus on important parts of the input data correctly describes this focus on important input parts.
Final Answer:
To focus on important parts of the input data -> Option B
Quick Check:
Attention = Focus on important input [OK]

Hint: Attention means focusing on key input parts [OK]

Common Mistakes:

Thinking attention increases input size
Confusing attention with model depth
Assuming attention shuffles data

2. Which of the following correctly represents the formula to compute attention weights using query (Q) and key (K) vectors?

easy

A. Sigmoid(Q - K)

B. Softmax(Q + K)

C. ReLU(Q x K)

D. Softmax(Q x K^T)

Solution

Step 1: Recall attention weight calculation
Attention weights are computed by taking the dot product of query and key vectors, then applying softmax.
Step 2: Match formula to options
Softmax(Q x K^T) shows softmax applied to Q multiplied by the transpose of K, which is correct.
Final Answer:
Softmax(Q x K^T) -> Option D
Quick Check:
Attention weights = softmax(dot product) [OK]

Hint: Attention weights = softmax of query-key dot product [OK]

Common Mistakes:

Adding Q and K instead of dot product
Using ReLU or Sigmoid instead of softmax
Ignoring transpose on key vector

3. Given query vector Q = [1, 0], key vectors K1 = [1, 0], K2 = [0, 1], and value vectors V1 = [10, 0], V2 = [0, 20], what is the attention output after applying softmax on Q·K^T and multiplying by values?

medium

A. [10, 0]

B. [5, 10]

C. [7.31, 5.38]

D. [0, 20]

Solution

Step 1: Calculate dot products Q·K1 and Q·K2
Q·K1 = 1*1 + 0*0 = 1; Q·K2 = 1*0 + 0*1 = 0.
Step 2: Apply softmax to [1, 0]
Softmax(1,0) = [e^1/(e^1+e^0), e^0/(e^1+e^0)] ≈ [0.731, 0.269].
Step 3: Multiply weights by values and sum
Output = 0.731*[10,0] + 0.269*[0,20] = [7.31, 0] + [0,5.38] = [7.31, 5.38].
Step 4: Match to options
The computed output [7.31, 5.38] matches [7.31, 5.38] (approximate values).
Final Answer:
[7.31, 5.38] -> Option C
Quick Check:
Softmax weights x values = output [OK]

Hint: Softmax weights times values gives attention output [OK]

Common Mistakes:

Skipping softmax normalization
Multiplying query with values directly
Ignoring vector multiplication order

4. Identify the error in this attention weight calculation code snippet:

import numpy as np
Q = np.array([1, 2])
K = np.array([[1, 0], [0, 1]])
scores = np.dot(Q, K)
weights = np.exp(scores) / np.sum(np.exp(scores))

medium

A. Dot product should be between Q and K transpose

B. Softmax calculation is incorrect

C. Q and K should be swapped in dot product

D. No error, code is correct

Solution

Step 1: Check dot product dimensions
Q is shape (2,), K is (2,2). np.dot(Q, K) results in shape (2,), but attention needs dot product with K transpose.
Step 2: Correct dot product usage
Dot product should be np.dot(Q, K.T) to get scores for each key vector.
Final Answer:
Dot product should be between Q and K transpose -> Option A
Quick Check:
Dot product with K transpose needed [OK]

Hint: Dot product query with key transpose for scores [OK]

Common Mistakes:

Using K instead of K transpose
Miscomputing softmax manually
Swapping Q and K incorrectly

5. In a transformer model, why is scaling the dot product by the square root of the key dimension important before applying softmax?

hard

A. To prevent large dot product values causing softmax to produce very small gradients

B. To increase the dot product values for better attention

C. To normalize the query vectors only

D. To reduce the number of keys processed

Solution

Step 1: Understand dot product scaling
Without scaling, large dot product values can make softmax outputs very close to 0 or 1, causing gradients to vanish during training.
Step 2: Purpose of scaling by sqrt of key dimension
Scaling reduces the magnitude of dot products, keeping softmax outputs more balanced and gradients healthy.
Final Answer:
To prevent large dot product values causing softmax to produce very small gradients -> Option A
Quick Check:
Scaling avoids gradient vanishing in softmax [OK]

Hint: Scale dot product to keep softmax gradients stable [OK]

Common Mistakes:

Thinking scaling increases dot product values
Believing scaling normalizes queries only
Assuming scaling reduces keys processed

Epoch	Loss ↓	Accuracy ↑	Observation
1	1.2	0.45	Model starts learning, loss high, accuracy low
2	0.9	0.60	Loss decreases, accuracy improves as attention helps
3	0.7	0.72	Model better focuses on important words
4	0.5	0.80	Attention weights refine, improving predictions
5	0.4	0.85	Training converges with good attention learning

Attention mechanism basics in NLP - Model Pipeline Trace

Start learning this pattern below

Practice

Solution

Step 1: Understand the role of attention

Step 2: Compare options with the concept

Final Answer:

Quick Check:

Solution

Step 1: Recall attention weight calculation

Step 2: Match formula to options

Final Answer:

Quick Check:

Solution

Step 1: Calculate dot products Q·K1 and Q·K2

Step 2: Apply softmax to [1, 0]

Step 3: Multiply weights by values and sum

Step 4: Match to options

Final Answer:

Quick Check:

Solution

Step 1: Check dot product dimensions

Step 2: Correct dot product usage

Final Answer:

Quick Check:

Solution

Step 1: Understand dot product scaling

Step 2: Purpose of scaling by sqrt of key dimension

Final Answer:

Quick Check: