beginner

What is the main purpose of a Transformer encoder in machine learning?

A Transformer encoder processes input data by capturing relationships between all parts of the input simultaneously, helping models understand context and meaning without relying on sequence order alone.

Click to reveal answer

beginner

What is 'self-attention' in the context of a Transformer encoder?

Self-attention is a mechanism where the model looks at all parts of the input to decide which parts are important to focus on when encoding each word or token.

Click to reveal answer

intermediate

Name the two main sub-layers inside a Transformer encoder block.

The two main sub-layers are: 1) Multi-head self-attention layer, 2) Position-wise feed-forward neural network.

Click to reveal answer

intermediate

Why do Transformer encoders use 'positional encoding'?

Because Transformers process all input tokens at once, positional encoding adds information about the order of tokens so the model knows the sequence position of each token.

Click to reveal answer

beginner

In PyTorch, which class can be used to create a Transformer encoder layer?

You can use torch.nn.TransformerEncoderLayer to create a single encoder layer, and torch.nn.TransformerEncoder to stack multiple layers.

Click to reveal answer

What does the self-attention mechanism in a Transformer encoder help the model do?

AReduce the size of the input data

BFocus on important parts of the input sequence

CSort the input tokens alphabetically

DGenerate output tokens directly

Which component adds information about token order in a Transformer encoder?

APositional encoding

BLayer normalization

CFeed-forward network

DMulti-head attention

What is the role of the feed-forward network inside a Transformer encoder layer?

ATo apply a simple neural network to each position independently

BTo add positional information

CTo combine outputs from multiple attention heads

DTo normalize the input data

In PyTorch, which class stacks multiple Transformer encoder layers?

Atorch.nn.TransformerEncoderLayer

Btorch.nn.MultiheadAttention

Ctorch.nn.TransformerEncoder

Dtorch.nn.Linear

Why do Transformer encoders process all tokens simultaneously instead of one by one?

ABecause they cannot handle sequences

BTo generate output faster

CTo sort tokens before processing

DTo reduce training time and capture global context

Explain how self-attention works inside a Transformer encoder and why it is important.

Describe the main components of a Transformer encoder layer and their roles.