Prompt Engineering / GenAIml~20 mins

Transformer architecture overview in Prompt Engineering / GenAI - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Challenge - 5 Problems

🎖️

Transformer Mastery Badge

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

2:00remaining

What is the main purpose of the self-attention mechanism in a Transformer?

The Transformer model uses a self-attention mechanism. What does this mechanism mainly do?

AIt helps the model focus on different parts of the input sequence to understand context.

BIt reduces the size of the input data by compressing it into a smaller vector.

CIt generates random noise to improve model robustness.

DIt sorts the input tokens in order of importance before processing.

Attempts:

2 left

❓ Predict Output

intermediate

2:00remaining

Output shape after multi-head attention layer

Given an input tensor of shape (batch_size=2, seq_len=5, embedding_dim=64) passed through a multi-head attention layer with 8 heads and output dimension 64, what is the shape of the output tensor?

Prompt Engineering / GenAI

import torch
import torch.nn as nn

batch_size = 2
seq_len = 5
embedding_dim = 64
num_heads = 8

x = torch.rand(batch_size, seq_len, embedding_dim)
mha = nn.MultiheadAttention(embed_dim=embedding_dim, num_heads=num_heads, batch_first=True)
output, _ = mha(x, x, x)
output.shape

A(2, 8, 8)

B(5, 2, 64)

C(2, 5, 512)

D(2, 5, 64)

Attempts:

2 left

❓ Hyperparameter

advanced

2:00remaining

Choosing the number of attention heads in a Transformer

Why might increasing the number of attention heads in a Transformer model improve performance?

ABecause more heads increase the embedding dimension automatically without extra computation.

BBecause more heads reduce the total number of parameters, making training faster.

CBecause more heads allow the model to attend to information from different representation subspaces at different positions.

DBecause more heads guarantee the model will not overfit on training data.

Attempts:

2 left

❓ Metrics

advanced

2:00remaining

Interpreting Transformer training loss curves

During training of a Transformer model, the training loss decreases steadily but the validation loss starts increasing after some epochs. What does this indicate?

AThe model is underfitting and needs more training epochs.

BThe model is overfitting the training data and not generalizing well to new data.

CThe learning rate is too low and should be increased.

DThe batch size is too large causing unstable training.

Attempts:

2 left

🔧 Debug

expert

3:00remaining

Identifying cause of NaN values in Transformer training

While training a Transformer model, the loss suddenly becomes NaN after a few epochs. Which of the following is the most likely cause?

AThe learning rate is too high, causing unstable gradients and exploding values.

BThe batch size is too small, causing insufficient gradient updates.

CThe model has too few layers, limiting its capacity.

DThe input data is normalized, which causes NaN values.

Attempts:

2 left

Practice

(1/5)

1. What is the main purpose of the attention mechanism in a Transformer model?

easy

A. To increase the size of the model

B. To focus on important parts of the input data

C. To reduce the number of layers

D. To store data permanently

Transformer architecture overview in Prompt Engineering / GenAI - Practice Problems & Coding Challenges

Start learning this pattern below

Practice

Solution

Step 1: Understand attention mechanism role

Step 2: Compare options with attention purpose

Final Answer:

Quick Check:

Solution

Step 1: Recall Transformer encoder layer structure

Step 2: Match the correct sequence

Final Answer:

Quick Check:

Solution

Step 1: Understand masking in decoder attention

Step 2: Evaluate options against masking purpose

Final Answer:

Quick Check:

Solution

Step 1: Check expected input shape for nn.MultiheadAttention

Step 2: Verify input tensor shape

Final Answer:

Quick Check:

Solution

Step 1: Identify components needed for translation

Step 2: Match components to translation needs

Final Answer:

Quick Check: