0
0
NLPml~5 mins

Attention mechanism in depth in NLP - Cheat Sheet & Quick Revision

Choose your learning style9 modes available
Recall & Review
beginner
What is the main purpose of the attention mechanism in neural networks?
The attention mechanism helps the model focus on the most important parts of the input data when making predictions, similar to how humans pay attention to relevant information.
Click to reveal answer
intermediate
Explain the difference between 'soft' and 'hard' attention.
Soft attention assigns weights to all input parts and computes a weighted sum, allowing smooth focus. Hard attention selects one part of the input, making it discrete and non-differentiable, often requiring special training methods.
Click to reveal answer
beginner
What are the three main components of the scaled dot-product attention?
The three components are Query (Q), Key (K), and Value (V). The attention score is computed by comparing Q with K, then used to weight V for the output.
Click to reveal answer
intermediate
Why do we scale the dot product by the square root of the key dimension in scaled dot-product attention?
Scaling by the square root of the key dimension prevents the dot product values from becoming too large, which can cause very small gradients and slow learning.
Click to reveal answer
intermediate
How does multi-head attention improve the model's ability to focus on different parts of the input?
Multi-head attention runs several attention mechanisms in parallel, each focusing on different parts or aspects of the input, allowing the model to capture diverse information.
Click to reveal answer
What does the 'Query' represent in the attention mechanism?
AThe information used to compare with keys
BThe part of the input we want to focus on
CThe output of the attention layer
DThe weights assigned to input tokens
Why is softmax used in attention mechanisms?
ATo select the maximum value only
BTo increase the size of the input
CTo reduce the number of parameters
DTo normalize attention scores into probabilities
Which of these is NOT a benefit of multi-head attention?
ACaptures information from different representation subspaces
BAllows the model to attend to multiple positions simultaneously
CReduces the total number of parameters drastically
DImproves the model's ability to understand complex relationships
What problem does the attention mechanism help solve in sequence models?
AVanishing gradients in deep networks
BDifficulty in remembering long-range dependencies
COverfitting on small datasets
DReducing training time by skipping layers
In scaled dot-product attention, what happens after computing the dot product between Query and Key?
AThe result is scaled and passed through softmax to get weights
BThe result is multiplied by the Value directly
CThe result is ignored and only Value is used
DThe result is passed through a ReLU activation
Describe how the attention mechanism works step-by-step in a neural network.
Think about how the model decides what to focus on.
You got /5 concepts.
    Explain why multi-head attention is more powerful than single-head attention.
    Imagine looking at a picture from different angles.
    You got /4 concepts.