beginner

What is the main purpose of the self-attention mechanism in neural networks?

Self-attention helps the model focus on different parts of the input sequence to understand relationships and context better, improving tasks like language understanding.

Click to reveal answer

intermediate

In self-attention, what are the Query, Key, and Value vectors?

They are vectors derived from the input that help compute attention scores: Query asks what to focus on, Key represents the content to compare against, and Value holds the actual information to be combined.

Click to reveal answer

intermediate

How is the attention score calculated in self-attention?

Attention scores are calculated by taking the dot product of the Query and Key vectors, then scaling and applying a softmax to get weights that sum to 1.

Click to reveal answer

advanced

Why do we scale the dot product by the square root of the key dimension in self-attention?

Scaling prevents the dot product values from becoming too large, which helps keep the softmax function stable and gradients well-behaved during training.

Click to reveal answer

beginner

What is the output of the self-attention mechanism?

The output is a weighted sum of the Value vectors, where weights come from the attention scores, representing a context-aware combination of input elements.

Click to reveal answer

What does the Query vector represent in self-attention?

AThe final prediction

BThe part of input asking what to focus on

CThe weights for output

DThe actual information to be combined

Why is softmax applied to the dot product of Query and Key vectors?

ATo normalize scores into probabilities

BTo increase the dot product values

CTo reduce the size of vectors

DTo create new vectors

What is the role of the Value vector in self-attention?

AIt is used to calculate dot product

BIt asks what to focus on

CIt holds the actual information to be combined

DIt normalizes the output

What problem does scaling the dot product by sqrt(d_k) solve?

ASpeeds up training

BIncreases the size of the output

CReduces the number of parameters

DPrevents large values that make softmax unstable

Which of these best describes self-attention?

AA way to relate different parts of the same input sequence

BA method to increase dataset size

CA technique to reduce model size

DA way to generate random noise

Explain how the self-attention mechanism computes its output from input vectors.

Why is self-attention important in models like Transformers for language tasks?