0
0
Prompt Engineering / GenAIml~20 mins

Transformer architecture overview in Prompt Engineering / GenAI - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Transformer Mastery Badge
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
What is the main purpose of the self-attention mechanism in a Transformer?

The Transformer model uses a self-attention mechanism. What does this mechanism mainly do?

AIt helps the model focus on different parts of the input sequence to understand context.
BIt reduces the size of the input data by compressing it into a smaller vector.
CIt generates random noise to improve model robustness.
DIt sorts the input tokens in order of importance before processing.
Attempts:
2 left
💡 Hint

Think about how the model understands relationships between words in a sentence.

Predict Output
intermediate
2:00remaining
Output shape after multi-head attention layer

Given an input tensor of shape (batch_size=2, seq_len=5, embedding_dim=64) passed through a multi-head attention layer with 8 heads and output dimension 64, what is the shape of the output tensor?

Prompt Engineering / GenAI
import torch
import torch.nn as nn

batch_size = 2
seq_len = 5
embedding_dim = 64
num_heads = 8

x = torch.rand(batch_size, seq_len, embedding_dim)
mha = nn.MultiheadAttention(embed_dim=embedding_dim, num_heads=num_heads, batch_first=True)
output, _ = mha(x, x, x)
output.shape
A(2, 8, 8)
B(5, 2, 64)
C(2, 5, 512)
D(2, 5, 64)
Attempts:
2 left
💡 Hint

Remember the output shape matches the input sequence length and embedding dimension.

Hyperparameter
advanced
2:00remaining
Choosing the number of attention heads in a Transformer

Why might increasing the number of attention heads in a Transformer model improve performance?

ABecause more heads increase the embedding dimension automatically without extra computation.
BBecause more heads reduce the total number of parameters, making training faster.
CBecause more heads allow the model to attend to information from different representation subspaces at different positions.
DBecause more heads guarantee the model will not overfit on training data.
Attempts:
2 left
💡 Hint

Think about how multiple heads help the model see different aspects of the input.

Metrics
advanced
2:00remaining
Interpreting Transformer training loss curves

During training of a Transformer model, the training loss decreases steadily but the validation loss starts increasing after some epochs. What does this indicate?

AThe model is underfitting and needs more training epochs.
BThe model is overfitting the training data and not generalizing well to new data.
CThe learning rate is too low and should be increased.
DThe batch size is too large causing unstable training.
Attempts:
2 left
💡 Hint

Think about what it means when validation loss worsens but training loss improves.

🔧 Debug
expert
3:00remaining
Identifying cause of NaN values in Transformer training

While training a Transformer model, the loss suddenly becomes NaN after a few epochs. Which of the following is the most likely cause?

AThe learning rate is too high, causing unstable gradients and exploding values.
BThe batch size is too small, causing insufficient gradient updates.
CThe model has too few layers, limiting its capacity.
DThe input data is normalized, which causes NaN values.
Attempts:
2 left
💡 Hint

Consider what can cause gradients or loss to become infinite or undefined.