Transformers are used in many tasks like translation, text classification, and question answering. Why do we need different transformer models for these tasks?
Think about how the output of a model changes depending on the task.
Different tasks require different outputs and ways to understand input. For example, translation needs to generate sentences, classification needs to assign labels. So, transformer models are adapted with specific layers or heads to fit the task.
You want to build a model to classify movie reviews as positive or negative. Which transformer model is best suited for this task?
Think about which model is designed to understand sentence meaning and output labels.
BERT is designed to understand sentence meaning and can be fine-tuned with a classification head to output sentiment labels. GPT-3 without fine-tuning is a generator, and translation models focus on language conversion, not classification.
Consider a transformer model fine-tuned for question answering. The input is a batch of 2 sequences, each with 10 tokens. The model outputs start and end logits for answer spans. What is the shape of the output logits?
import torch batch_size = 2 seq_len = 10 start_logits = torch.randn(batch_size, seq_len) end_logits = torch.randn(batch_size, seq_len) print(start_logits.shape, end_logits.shape)
Think about how logits correspond to tokens in each sequence for each batch item.
For question answering, the model outputs two sets of logits per token position: one for start and one for end. Since batch size is 2 and sequence length is 10, output shapes are (2, 10) for both start and end logits.
What is the main effect of increasing the number of attention heads in a transformer model?
Think about what multiple attention heads do in the transformer.
Multiple attention heads let the model look at different parts of the input at once, capturing various relationships and improving understanding. It does not reduce size or change output type.
You fine-tuned a transformer for text classification, but the model always outputs zeros for predictions. What is the most likely cause?
import torch
from transformers import BertForSequenceClassification, BertTokenizer
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
inputs = tokenizer('Hello world', return_tensors='pt')
outputs = model(**inputs)
print(outputs.logits)Think about what happens if you use a pretrained model without fine-tuning for classification.
The pretrained BERT model for sequence classification is randomly initialized for the classification head and not trained yet. So logits are near zero or random but often close to zero. Tokenizer and architecture are correct, and input length does not cause zero outputs.