Practice

(1/5)

1. What is the main purpose of the attention mechanism in NLP models?

easy

A. To increase the size of the input data

B. To reduce the number of layers in the model

C. To help the model focus on important parts of the input data

D. To randomly shuffle the input tokens

Solution

Step 1: Understand attention's role
Attention helps models decide which parts of the input are most important for the task.
Step 2: Compare options
Only To help the model focus on important parts of the input data correctly describes this focus mechanism; others describe unrelated actions.
Final Answer:
To help the model focus on important parts of the input data -> Option C
Quick Check:
Attention = Focus on important input [OK]

Hint: Remember: attention means focusing on key input parts [OK]

Common Mistakes:

Thinking attention changes input size
Confusing attention with model depth
Assuming attention shuffles data

2. Which of the following correctly represents the formula for attention weights using queries (Q), keys (K), and softmax?

easy

A. softmax(Q x K^T)

B. Q + K

C. softmax(Q - K)

D. Q x K

Solution

Step 1: Recall attention weight calculation
Attention weights are computed by multiplying queries with keys transposed, then applying softmax.
Step 2: Evaluate options
Only softmax(Q x K^T) matches the correct formula softmax(Q x K^T). Others are incorrect operations.
Final Answer:
softmax(Q x K^T) -> Option A
Quick Check:
Attention weights = softmax(Q x K^T) [OK]

Hint: Attention weights = softmax of query-key dot product [OK]

Common Mistakes:

Using addition instead of multiplication
Forgetting to transpose keys
Skipping softmax normalization

3. Given queries Q = [[1, 0]], keys K = [[1, 0], [-10, 1]], and values V = [[10, 20], [30, 40]], what is the output of the attention mechanism (using dot product and softmax)?

medium

A. [[10, 20]]

B. [[20, 30]]

C. [[20, 40]]

D. [[30, 40]]

Solution

Step 1: Calculate dot products Q x K^T
Q = [1,0], K = [[1,0],[-10,1]]; dot products: [1*1+0*0=1, 1*(-10)+0*1=-10]
Step 2: Apply softmax to scores
softmax([1,-10]) ≈ [1, 0] (e^{-10} negligible)
Step 3: Compute weighted sum of values
Output ≈ 1*[10,20] + 0*[30,40] = [[10, 20]]
Step 4: Match option
[[10, 20]] matches exactly.
Final Answer:
[[10, 20]] -> Option A
Quick Check:
Weighted sum of values = [[10, 20]] [OK]

Hint: Calculate dot, softmax, then weighted sum of values [OK]

Common Mistakes:

Skipping softmax normalization
Using keys instead of values for output
Incorrect dot product calculation

4. Identify the error in this attention weight calculation code snippet:

import numpy as np
Q = np.array([[1, 0]])
K = np.array([[1, 0], [-10, 1]])
scores = np.dot(Q, K)
weights = np.exp(scores) / np.sum(np.exp(scores))

medium

A. Values are missing in the calculation

B. Softmax is applied incorrectly

C. Queries and keys have incompatible shapes

D. Keys should be transposed before dot product

Solution

Step 1: Check dot product operation
Dot product should be between Q and K transposed to align dimensions correctly.
Step 2: Analyze code
Code uses np.dot(Q, K) without transposing K, causing wrong shape and incorrect scores.
Final Answer:
Keys should be transposed before dot product -> Option D
Quick Check:
Transpose keys before dot product [OK]

Hint: Always transpose keys before dot product with queries [OK]

Common Mistakes:

Forgetting to transpose keys
Misapplying softmax formula
Ignoring shape compatibility

5. In a transformer model, why is scaling the dot product by the square root of the key dimension important before applying softmax?

hard

A. To increase the dot product values for better attention

B. To prevent large dot product values causing very small gradients

C. To normalize the values between 0 and 1

D. To reduce the number of keys used in attention

Solution

Step 1: Understand dot product scaling
Large dot products can cause softmax to produce very small gradients, slowing learning.
Step 2: Role of scaling by sqrt of key dimension
Scaling reduces dot product magnitude, stabilizing gradients and improving training.
Final Answer:
To prevent large dot product values causing very small gradients -> Option B
Quick Check:
Scaling avoids tiny gradients in softmax [OK]

Hint: Scale dot product to keep gradients healthy [OK]

Common Mistakes:

Thinking scaling increases dot product
Confusing scaling with normalization to [0,1]
Assuming scaling reduces keys count

Why Attention mechanism in depth in NLP? - Purpose & Use Cases

Start learning this pattern below

Practice

Solution

Step 1: Understand attention's role

Step 2: Compare options

Final Answer:

Quick Check:

Solution

Step 1: Recall attention weight calculation

Step 2: Evaluate options

Final Answer:

Quick Check:

Solution

Step 1: Calculate dot products Q x K^T

Step 2: Apply softmax to scores

Step 3: Compute weighted sum of values

Step 4: Match option

Final Answer:

Quick Check:

Solution

Step 1: Check dot product operation

Step 2: Analyze code

Final Answer:

Quick Check:

Solution

Step 1: Understand dot product scaling

Step 2: Role of scaling by sqrt of key dimension

Final Answer:

Quick Check: