NLPml~20 mins

Self-attention and multi-head attention in NLP - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Challenge - 5 Problems

🎖️

Self-Attention Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

2:00remaining

What does self-attention compute in a transformer?

In a transformer model, self-attention helps the model focus on different parts of the input sequence. What exactly does self-attention compute?

AIt applies a convolution operation over the input sequence to extract local features.

BIt computes a weighted sum of the input elements where weights depend on the similarity between elements.

CIt computes the average of all input tokens without considering their relationships.

DIt sorts the input tokens based on their frequency in the sequence.

Attempts:

2 left

❓ Predict Output

intermediate

2:00remaining

Output shape of multi-head attention layer

Given the following PyTorch code snippet for a multi-head attention layer, what is the shape of the output tensor?

NLP

import torch
import torch.nn as nn

batch_size = 2
seq_len = 5
embed_dim = 16
num_heads = 4

mha = nn.MultiheadAttention(embed_dim=embed_dim, num_heads=num_heads)
query = torch.rand(seq_len, batch_size, embed_dim)
key = torch.rand(seq_len, batch_size, embed_dim)
value = torch.rand(seq_len, batch_size, embed_dim)

output, _ = mha(query, key, value)
print(output.shape)

A(5, 2, 16)

B(2, 16, 5)

C(5, 16, 2)

D(2, 5, 16)

Attempts:

2 left

❓ Hyperparameter

advanced

2:00remaining

Effect of increasing number of heads in multi-head attention

What is the main effect of increasing the number of heads in a multi-head attention mechanism while keeping the total embedding dimension fixed?

AThe attention weights become uniform across all tokens.

BThe total embedding dimension increases, making the model slower but more accurate.

CThe model ignores positional information and treats all tokens equally.

DEach head has a smaller dimension, allowing the model to focus on different representation subspaces.

Attempts:

2 left

❓ Metrics

advanced

2:00remaining

Interpreting attention weights in self-attention

In a trained transformer model, you extract the attention weights from a self-attention layer. What does a high attention weight between two tokens indicate?

AThe tokens appear consecutively in the input sequence.

BThe two tokens are identical in the input sequence.

CThe model considers the two tokens highly related or important to each other for the current task.

DThe tokens have the same part of speech tag.

Attempts:

2 left

🔧 Debug

expert

3:00remaining

Identifying error in custom multi-head attention implementation

Consider this simplified custom multi-head attention code snippet. What error will it raise when run?

NLP

import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleMultiHeadAttention(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        self.q_linear = nn.Linear(embed_dim, embed_dim)
        self.k_linear = nn.Linear(embed_dim, embed_dim)
        self.v_linear = nn.Linear(embed_dim, embed_dim)
        self.out_linear = nn.Linear(embed_dim, embed_dim)

    def forward(self, x):
        batch_size, seq_len, embed_dim = x.size()
        Q = self.q_linear(x)
        K = self.k_linear(x)
        V = self.v_linear(x)

        # reshape for multi-head
        Q = Q.view(batch_size, seq_len, self.num_heads, self.head_dim)
        K = K.view(batch_size, seq_len, self.num_heads, self.head_dim)
        V = V.view(batch_size, seq_len, self.num_heads, self.head_dim)

        # transpose to get dimensions batch_size, num_heads, seq_len, head_dim
        Q = Q.transpose(1, 2)
        K = K.transpose(1, 2)
        V = V.transpose(1, 2)

        scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.head_dim ** 0.5)
        attn = F.softmax(scores, dim=-1)
        out = torch.matmul(attn, V)

        # concatenate heads
        out = out.transpose(1, 2).contiguous().view(batch_size, seq_len, embed_dim)
        out = self.out_linear(out)
        return out

x = torch.rand(2, 5, 16)
model = SimpleMultiHeadAttention(embed_dim=16, num_heads=4)
output = model(x)

ANo error, code runs successfully and output shape is (2, 5, 16)

BAttributeError because 'SimpleMultiHeadAttention' object has no attribute 'out_linear'

CRuntimeError due to shape mismatch in torch.matmul(Q, K.transpose(-2, -1))

DRuntimeError due to incorrect batch size in input tensor

Attempts:

2 left

Practice

(1/5)

1. What is the main purpose of self-attention in natural language processing?

easy

A. To reduce the size of the input data by removing words

B. To generate random sentences without context

C. To translate text from one language to another

D. To let the model focus on important words by comparing all words to each other

Self-attention and multi-head attention in NLP - Practice Problems & Coding Challenges

Start learning this pattern below

Practice

Solution

Step 1: Understand self-attention's role

Step 2: Match purpose with options

Final Answer:

Quick Check:

Solution

Step 1: Recall multi-head attention definition

Step 2: Compare options to definition

Final Answer:

Quick Check:

Solution

Step 1: Extract the second row scores

Step 2: Apply softmax to these scores

Final Answer:

Quick Check:

Solution

Step 1: Analyze softmax calculation

Step 2: Check output aggregation

Final Answer:

Quick Check:

Solution

Step 1: Understand effect of increasing attention heads

Step 2: Consider computational cost and accuracy

Final Answer:

Quick Check: