Bird
Raised Fist0
NLPml~20 mins

Why transformers revolutionized NLP - Challenge Your Understanding

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
Transformer Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
Key innovation of transformers in NLP
Which feature of transformers most directly allows them to handle long-range dependencies in text better than previous models?
AConvolutional layers that capture local patterns in fixed windows
BUse of recurrent connections to process sequences step-by-step
CPredefined fixed-length context windows for input sequences
DSelf-attention mechanism that weighs all words in a sentence simultaneously
Attempts:
2 left
💡 Hint
Think about how the model can look at all words at once instead of one by one.
Model Choice
intermediate
2:00remaining
Choosing a model architecture for NLP tasks
You want to build a model that understands context in long documents for summarization. Which model architecture is best suited?
ARecurrent Neural Network (RNN) with LSTM cells
BTransformer with self-attention layers
CConvolutional Neural Network (CNN) with small kernels
DSimple feedforward neural network
Attempts:
2 left
💡 Hint
Consider which model can capture relationships across long text spans.
Metrics
advanced
2:00remaining
Evaluating transformer model performance
After training a transformer for language translation, which metric best measures how well the model's output matches human translations?
ABLEU score comparing generated and reference sentences
BAccuracy of predicted next word
CMean squared error between word embeddings
DConfusion matrix of predicted classes
Attempts:
2 left
💡 Hint
Think about a metric designed for comparing sentences in translation tasks.
🔧 Debug
advanced
2:00remaining
Identifying a common transformer training issue
You trained a transformer model but notice the training loss does not decrease and stays very high. Which issue is most likely causing this?
AUsing batch size too large causing slow convergence
BNot using dropout layers causing overfitting
CUsing a learning rate that is too high causing unstable updates
DApplying layer normalization after the output layer
Attempts:
2 left
💡 Hint
Consider what happens if the model weights update too aggressively.
Predict Output
expert
3:00remaining
Output shape of transformer attention scores
Given the following PyTorch code snippet for a transformer attention layer, what is the shape of the 'attention_scores' tensor?
NLP
import torch
batch_size = 2
seq_len = 5
embed_dim = 16
num_heads = 4

# Query, Key tensors
Q = torch.rand(batch_size, num_heads, seq_len, embed_dim // num_heads)
K = torch.rand(batch_size, num_heads, seq_len, embed_dim // num_heads)

# Compute attention scores
attention_scores = torch.matmul(Q, K.transpose(-2, -1))
Atorch.Size([2, 4, 5, 5])
Btorch.Size([2, 4, 16, 16])
Ctorch.Size([2, 16, 5, 5])
Dtorch.Size([2, 5, 4, 4])
Attempts:
2 left
💡 Hint
Recall that attention scores are computed by multiplying queries and keys along the embedding dimension.

Practice

(1/5)
1. Why did transformers change the way machines understand language in NLP?
easy
A. Because they use simple rules without learning
B. Because they consider the whole sentence context at once
C. Because they only look at one word at a time
D. Because they ignore word order completely

Solution

  1. Step 1: Understand traditional NLP limits

    Older models processed words one by one or in small groups, missing full sentence meaning.
  2. Step 2: Recognize transformer's key feature

    Transformers look at all words together, capturing context better.
  3. Final Answer:

    Because they consider the whole sentence context at once -> Option B
  4. Quick Check:

    Context awareness = C [OK]
Hint: Transformers see all words together, not one by one [OK]
Common Mistakes:
  • Thinking transformers process words one at a time
  • Believing transformers ignore word order
  • Confusing transformers with rule-based systems
2. Which of the following is the correct way to describe the transformer's attention mechanism?
easy
A. It randomly selects words to ignore
B. It translates words without looking at context
C. It focuses on important words by assigning weights to them
D. It removes all punctuation before processing

Solution

  1. Step 1: Recall attention purpose

    Attention helps the model decide which words matter more in a sentence.
  2. Step 2: Match description to attention

    Assigning weights to words matches how attention works.
  3. Final Answer:

    It focuses on important words by assigning weights to them -> Option C
  4. Quick Check:

    Attention = weighted focus [OK]
Hint: Attention means weighting important words higher [OK]
Common Mistakes:
  • Thinking attention ignores words randomly
  • Believing attention removes punctuation
  • Confusing attention with translation
3. Given this simplified transformer attention code snippet, what will be the output shape if input has shape (batch_size=2, seq_len=3, embed_dim=4)?
import torch
from torch.nn import MultiheadAttention

input_tensor = torch.rand(3, 2, 4)  # seq_len, batch_size, embed_dim
attention = MultiheadAttention(embed_dim=4, num_heads=2)
output, _ = attention(input_tensor, input_tensor, input_tensor)
print(output.shape)
medium
A. torch.Size([3, 2, 4])
B. torch.Size([2, 3, 4])
C. torch.Size([3, 4, 2])
D. torch.Size([2, 4, 3])

Solution

  1. Step 1: Understand input shape format

    Input shape is (seq_len=3, batch_size=2, embed_dim=4) as required by PyTorch MultiheadAttention.
  2. Step 2: Check output shape from attention

    Output shape matches input shape: (seq_len, batch_size, embed_dim) = (3, 2, 4).
  3. Final Answer:

    torch.Size([3, 2, 4]) -> Option A
  4. Quick Check:

    Output shape = input shape [OK]
Hint: Output shape matches input shape in PyTorch attention [OK]
Common Mistakes:
  • Mixing batch and sequence dimensions
  • Assuming output shape changes embed dimension
  • Confusing PyTorch input format with batch-first
4. This code tries to create a transformer model but throws an error. What is the mistake?
from transformers import BertModel

model = BertModel()
output = model("Hello world")
medium
A. The string input should be a list, not a string
B. BertModel cannot be imported from transformers
C. The model must be trained before use
D. BertModel requires tokenized input, not raw text

Solution

  1. Step 1: Check input type for BertModel

    BertModel expects token IDs (numbers), not raw text strings.
  2. Step 2: Identify correct input preparation

    Text must be tokenized using a tokenizer before passing to the model.
  3. Final Answer:

    BertModel requires tokenized input, not raw text -> Option D
  4. Quick Check:

    Tokenize text before model input [OK]
Hint: Always tokenize text before feeding to transformer models [OK]
Common Mistakes:
  • Passing raw strings directly to model
  • Assuming model auto-tokenizes input
  • Ignoring need for attention masks
5. You want to build a chatbot using transformers that can understand long conversations. Which feature of transformers helps handle long context better than older models?
hard
A. Self-attention mechanism that relates all words in the input
B. Using fixed-size windows to read text piece by piece
C. Ignoring previous sentences to focus on current input
D. Replacing words with fixed dictionaries without learning

Solution

  1. Step 1: Understand chatbot context needs

    Chatbots must remember and relate words across long conversations.
  2. Step 2: Identify transformer feature for long context

    Self-attention lets the model connect all words, even far apart, in one pass.
  3. Final Answer:

    Self-attention mechanism that relates all words in the input -> Option A
  4. Quick Check:

    Self-attention = long context handling [OK]
Hint: Self-attention links all words for long context [OK]
Common Mistakes:
  • Thinking transformers read text in small fixed windows
  • Believing transformers ignore previous sentences
  • Confusing dictionary lookup with learning