Bird
0
0

Why is the dot product in scaled dot-product attention divided by the square root of the key dimension (d_k) before applying softmax?

hard📝 Application Q9 of 15
NLP - Sequence Models for NLP
Why is the dot product in scaled dot-product attention divided by the square root of the key dimension (d_k) before applying softmax?
ATo prevent the dot product values from becoming too large, which can cause gradients to vanish during training.
BTo increase the magnitude of dot products, making softmax outputs more confident.
CTo normalize the key vectors to unit length before computing attention.
DTo reduce the computational complexity of the attention mechanism.
Step-by-Step Solution
Solution:
  1. Step 1: Understand dot product scale

    Dot products grow with vector dimension, causing large values.
  2. Step 2: Effect on softmax

    Large values cause softmax to saturate, leading to very small gradients.
  3. Step 3: Scaling purpose

    Dividing by sqrt(d_k) keeps values in a range that stabilizes gradients.
  4. Final Answer:

    To prevent the dot product values from becoming too large, which can cause gradients to vanish during training. -> Option A
  5. Quick Check:

    Scaling prevents large values and gradient issues [OK]
Quick Trick: Scale dot product to avoid softmax saturation [OK]
Common Mistakes:
MISTAKES
  • Thinking scaling increases confidence
  • Confusing scaling with normalization of vectors
  • Assuming scaling reduces computation

Want More Practice?

15+ quiz questions · All difficulty levels · Free

Free Signup - Practice All Questions
More NLP Quizzes