NLP - Sequence Models for NLP
Why is the dot product in scaled dot-product attention divided by the square root of the key dimension (d_k) before applying softmax?
15+ quiz questions · All difficulty levels · Free
Free Signup - Practice All Questions