Bird
Raised Fist0
NLPml~5 mins

Self-attention and multi-head attention in NLP - Cheat Sheet & Quick Revision

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Recall & Review
beginner
What is self-attention in simple terms?
Self-attention is a way for a model to look at all parts of a sentence at once and decide which words are important to understand each word better.
Click to reveal answer
intermediate
Why do we use multi-head attention instead of just one attention?
Multi-head attention lets the model look at the sentence from different views or angles at the same time, helping it understand more details and relationships.
Click to reveal answer
intermediate
In self-attention, what are queries, keys, and values?
Queries, keys, and values are three sets of numbers made from the input words. The model compares queries with keys to find important words, then uses values to get the final information.
Click to reveal answer
beginner
How does self-attention help in understanding the meaning of a word in a sentence?
Self-attention helps by giving more focus to words that matter for understanding a word’s meaning, like paying attention to related words nearby or far away in the sentence.
Click to reveal answer
intermediate
What is the main benefit of using multi-head attention in models like Transformers?
It allows the model to capture different types of relationships and features in the data simultaneously, making the model smarter and better at tasks like translation or text understanding.
Click to reveal answer
What does self-attention allow a model to do?
AIgnore the order of words completely
BLook at all words in a sentence to find important ones
COnly focus on the first word in a sentence
DTranslate sentences without any context
Why is multi-head attention better than single-head attention?
AIt looks at the input from multiple perspectives at once
BIt uses less memory
CIt ignores irrelevant words
DIt only focuses on one word at a time
In self-attention, what is the role of the 'keys'?
AThey are ignored during attention
BThey store the final output
CThey represent the sentence length
DThey are compared with queries to find important words
Which of these is NOT a benefit of self-attention?
ACapturing relationships between distant words
BUnderstanding word importance in context
CReducing the size of the input data
DAllowing parallel processing of words
What does each 'head' in multi-head attention do?
AFocuses on different parts or features of the input
BProcesses the entire sentence identically
CRemoves irrelevant words
DGenerates random outputs
Explain how self-attention works using a simple example of a sentence.
Think about how a word in a sentence can 'look' at other words to understand its meaning better.
You got /4 concepts.
    Describe why multi-head attention improves model understanding compared to single-head attention.
    Imagine looking at a problem from different angles to get a fuller picture.
    You got /4 concepts.

      Practice

      (1/5)
      1. What is the main purpose of self-attention in natural language processing?
      easy
      A. To reduce the size of the input data by removing words
      B. To generate random sentences without context
      C. To translate text from one language to another
      D. To let the model focus on important words by comparing all words to each other

      Solution

      1. Step 1: Understand self-attention's role

        Self-attention helps the model look at all words in a sentence and decide which ones are important by comparing them to each other.
      2. Step 2: Match purpose with options

        To let the model focus on important words by comparing all words to each other correctly describes this focus mechanism, while others describe unrelated tasks.
      3. Final Answer:

        To let the model focus on important words by comparing all words to each other -> Option D
      4. Quick Check:

        Self-attention = focus on important words [OK]
      Hint: Self-attention means comparing words to find importance [OK]
      Common Mistakes:
      • Confusing self-attention with translation
      • Thinking self-attention removes words
      • Assuming it generates random text
      2. Which of the following is the correct way to describe multi-head attention?
      easy
      A. Running several self-attention processes in parallel to get richer understanding
      B. Applying self-attention only once on the input
      C. Using attention only on the first word of a sentence
      D. Ignoring word relationships and focusing on word order only

      Solution

      1. Step 1: Recall multi-head attention definition

        Multi-head attention means running multiple self-attention operations at the same time to capture different aspects of word relationships.
      2. Step 2: Compare options to definition

        Running several self-attention processes in parallel to get richer understanding matches this exactly; others describe incomplete or incorrect ideas.
      3. Final Answer:

        Running several self-attention processes in parallel to get richer understanding -> Option A
      4. Quick Check:

        Multi-head attention = multiple self-attentions [OK]
      Hint: Multi-head means many self-attentions at once [OK]
      Common Mistakes:
      • Thinking multi-head means single attention
      • Believing it focuses only on first word
      • Ignoring word relationships
      3. Given the following simplified self-attention scores matrix for a 3-word sentence:
      Scores = [[1, 0.5, 0], [0.5, 1, 0.2], [0, 0.2, 1]]
      What is the attention weight for the second word attending to the third word after applying softmax on its row?
      medium
      A. Approximately 0.21
      B. Approximately 0.50
      C. Approximately 0.29
      D. Approximately 0.70

      Solution

      1. Step 1: Extract the second row scores

        The second word's scores are [0.5, 1, 0.2].
      2. Step 2: Apply softmax to these scores

        Softmax formula: exp(score) / sum(exp(all scores)). Calculate exp(0.5)=1.65, exp(1)=2.72, exp(0.2)=1.22. Sum = 1.65+2.72+1.22=5.59. Attention weight for third word = 1.22/5.59 ≈ 0.218.
      3. Final Answer:

        Approximately 0.21 -> Option A
      4. Quick Check:

        Softmax normalizes scores to probabilities [OK]
      Hint: Softmax turns scores into probabilities summing to 1 [OK]
      Common Mistakes:
      • Forgetting to exponentiate scores
      • Dividing by wrong sum
      • Mixing row and column values
      4. Consider this Python code snippet for multi-head attention weights calculation:
      import numpy as np
      
      def multi_head_attention(scores_list):
          heads = []
          for scores in scores_list:
              weights = np.exp(scores) / np.sum(np.exp(scores))
              heads.append(weights)
          return np.mean(heads, axis=0)
      
      scores_list = [np.array([1, 0, 2]), np.array([0, 1, 1])]
      print(multi_head_attention(scores_list))

      What is the main bug in this code?
      medium
      A. Softmax is applied incorrectly; denominator should sum over exp(scores) per head
      B. The function returns mean of weights instead of concatenating heads
      C. The code uses np.exp twice causing overflow
      D. Scores_list should be a 2D array, not a list of arrays

      Solution

      1. Step 1: Analyze softmax calculation

        Softmax is correctly applied per head by dividing exp(scores) by sum of exp(scores).
      2. Step 2: Check output aggregation

        The function averages the weights from each head, but multi-head attention should concatenate or combine heads differently, not average weights element-wise.
      3. Final Answer:

        The function returns mean of weights instead of concatenating heads -> Option B
      4. Quick Check:

        Multi-head attention combines heads, not averages weights [OK]
      Hint: Multi-head attention concatenates heads, not averages weights [OK]
      Common Mistakes:
      • Thinking averaging weights is correct
      • Confusing softmax denominator
      • Assuming input format is wrong
      5. You want to improve a Transformer model's ability to understand complex sentences by increasing the number of attention heads from 4 to 8. What is the most likely effect of this change?
      hard
      A. The model will ignore word order completely
      B. The model will run faster but lose accuracy
      C. The model can capture more diverse word relationships but may require more computation
      D. The model will only focus on the first half of the sentence

      Solution

      1. Step 1: Understand effect of increasing attention heads

        More heads mean the model can look at different parts of the sentence simultaneously, capturing richer relationships.
      2. Step 2: Consider computational cost and accuracy

        Increasing heads usually increases computation and memory needs but can improve understanding and accuracy.
      3. Final Answer:

        The model can capture more diverse word relationships but may require more computation -> Option C
      4. Quick Check:

        More heads = richer focus + more compute [OK]
      Hint: More heads = better focus but slower model [OK]
      Common Mistakes:
      • Assuming more heads always make model faster
      • Thinking word order is ignored
      • Believing model focuses only on part of sentence