0
0
PyTorchml~5 mins

Self-attention mechanism in PyTorch - Cheat Sheet & Quick Revision

Choose your learning style9 modes available
Recall & Review
beginner
What is the main purpose of the self-attention mechanism in neural networks?
Self-attention helps the model focus on different parts of the input sequence to understand relationships and context better, improving tasks like language understanding.
Click to reveal answer
intermediate
In self-attention, what are the Query, Key, and Value vectors?
They are vectors derived from the input that help compute attention scores: Query asks what to focus on, Key represents the content to compare against, and Value holds the actual information to be combined.
Click to reveal answer
intermediate
How is the attention score calculated in self-attention?
Attention scores are calculated by taking the dot product of the Query and Key vectors, then scaling and applying a softmax to get weights that sum to 1.
Click to reveal answer
advanced
Why do we scale the dot product by the square root of the key dimension in self-attention?
Scaling prevents the dot product values from becoming too large, which helps keep the softmax function stable and gradients well-behaved during training.
Click to reveal answer
beginner
What is the output of the self-attention mechanism?
The output is a weighted sum of the Value vectors, where weights come from the attention scores, representing a context-aware combination of input elements.
Click to reveal answer
What does the Query vector represent in self-attention?
AThe final prediction
BThe part of input asking what to focus on
CThe weights for output
DThe actual information to be combined
Why is softmax applied to the dot product of Query and Key vectors?
ATo normalize scores into probabilities
BTo increase the dot product values
CTo reduce the size of vectors
DTo create new vectors
What is the role of the Value vector in self-attention?
AIt is used to calculate dot product
BIt asks what to focus on
CIt holds the actual information to be combined
DIt normalizes the output
What problem does scaling the dot product by sqrt(d_k) solve?
ASpeeds up training
BIncreases the size of the output
CReduces the number of parameters
DPrevents large values that make softmax unstable
Which of these best describes self-attention?
AA way to relate different parts of the same input sequence
BA method to increase dataset size
CA technique to reduce model size
DA way to generate random noise
Explain how the self-attention mechanism computes its output from input vectors.
Think about how the model decides what parts of the input to focus on.
You got /4 concepts.
    Why is self-attention important in models like Transformers for language tasks?
    Consider how words in a sentence relate to each other.
    You got /4 concepts.