Complete the code to create the input embedding layer for a Transformer model.
embedding_layer = nn.Embedding(num_tokens, [1])The embedding layer converts token indices into vectors of size embedding_dim, which is the input size for the Transformer.
Complete the code to apply multi-head attention in the Transformer encoder block.
attention_output, _ = multihead_attn(query, key, value, [1]=key_padding_mask)The key_padding_mask tells the attention which tokens to ignore (padding tokens) during computation.
Fix the error in the Transformer feed-forward network layer by completing the missing activation function.
ffn_output = linear2([1](linear1(x)))The feed-forward network uses ReLU activation to add non-linearity between two linear layers.
Fill both blanks to create a positional encoding function that adds position info to token embeddings.
positional_encoding = torch.zeros(seq_len, [1]) for pos in range(seq_len): for i in range(0, [2], 2): positional_encoding[pos, i] = math.sin(pos / (10000 ** (i / [2])))
The positional encoding matrix has shape (sequence length, embedding dimension). The formula uses embedding_dim to scale positions.
Fill all three blanks to complete the Transformer encoder layer with normalization and residual connections.
x = x + [1](multihead_attn(x, x, x)) x = [2](x) residual = x x = x + [3](feed_forward(x))
Dropout is applied after attention and feed-forward layers for regularization. Layer normalization is applied before residual addition to stabilize training.