Recall & Review
beginner
What is the main purpose of the attention mechanism in neural networks?
The attention mechanism helps the model focus on the most important parts of the input data when making predictions, similar to how humans pay attention to relevant information.
Click to reveal answer
intermediate
Explain the difference between 'soft' and 'hard' attention.
Soft attention assigns weights to all input parts and computes a weighted sum, allowing smooth focus. Hard attention selects one part of the input, making it discrete and non-differentiable, often requiring special training methods.
Click to reveal answer
beginner
What are the three main components of the scaled dot-product attention?
The three components are Query (Q), Key (K), and Value (V). The attention score is computed by comparing Q with K, then used to weight V for the output.
Click to reveal answer
intermediate
Why do we scale the dot product by the square root of the key dimension in scaled dot-product attention?
Scaling by the square root of the key dimension prevents the dot product values from becoming too large, which can cause very small gradients and slow learning.
Click to reveal answer
intermediate
How does multi-head attention improve the model's ability to focus on different parts of the input?
Multi-head attention runs several attention mechanisms in parallel, each focusing on different parts or aspects of the input, allowing the model to capture diverse information.
Click to reveal answer
What does the 'Query' represent in the attention mechanism?
✗ Incorrect
The Query is used to compare with Keys to calculate attention scores.
Why is softmax used in attention mechanisms?
✗ Incorrect
Softmax converts raw attention scores into probabilities that sum to 1.
Which of these is NOT a benefit of multi-head attention?
✗ Incorrect
Multi-head attention increases parameters due to multiple heads, not reduces them.
What problem does the attention mechanism help solve in sequence models?
✗ Incorrect
Attention helps models remember and focus on important parts of long sequences.
In scaled dot-product attention, what happens after computing the dot product between Query and Key?
✗ Incorrect
The dot product is scaled and softmaxed to produce attention weights.
Describe how the attention mechanism works step-by-step in a neural network.
Think about how the model decides what to focus on.
You got /5 concepts.
Explain why multi-head attention is more powerful than single-head attention.
Imagine looking at a picture from different angles.
You got /4 concepts.