Why is the dot product in scaled dot-product attention divided by the square root of the key dimension (d_k) before applying softmax?

hard📝 Application Q9 of 15

NLP - Sequence Models for NLP

ATo prevent the dot product values from becoming too large, which can cause gradients to vanish during training.

BTo increase the magnitude of dot products, making softmax outputs more confident.

CTo normalize the key vectors to unit length before computing attention.

DTo reduce the computational complexity of the attention mechanism.

Step-by-Step Solution

Solution:

Step 1: Understand dot product scale
Dot products grow with vector dimension, causing large values.
Step 2: Effect on softmax
Large values cause softmax to saturate, leading to very small gradients.
Step 3: Scaling purpose
Dividing by sqrt(d_k) keeps values in a range that stabilizes gradients.
Final Answer:
To prevent the dot product values from becoming too large, which can cause gradients to vanish during training. -> Option A
Quick Check:
Scaling prevents large values and gradient issues [OK]

Quick Trick: Scale dot product to avoid softmax saturation [OK]

Common Mistakes:

MISTAKES

Master "Sequence Models for NLP" in NLP

9 interactive learning modes - each teaches the same concept differently

Want More Practice?

15+ quiz questions · All difficulty levels · Free

More NLP Quizzes