Which of the following best describes the main difference in the pretraining objectives between RoBERTa and DistilBERT?
Think about how RoBERTa improved BERT's training and how DistilBERT reduces model size.
RoBERTa improves BERT by using dynamic masking and removing next sentence prediction. DistilBERT is trained by distilling knowledge from BERT using a loss that mimics BERT's outputs.
Given the following code snippet using Hugging Face Transformers, what is the shape of the last_hidden_state tensor?
from transformers import RobertaModel, RobertaTokenizer import torch tokenizer = RobertaTokenizer.from_pretrained('roberta-base') model = RobertaModel.from_pretrained('roberta-base') inputs = tokenizer('Hello world!', return_tensors='pt') outputs = model(**inputs) last_hidden_state = outputs.last_hidden_state print(last_hidden_state.shape)
Count the tokens after tokenization including special tokens.
The tokenizer adds special tokens, so 'Hello world!' becomes 4 tokens. The hidden size of roberta-base is 768.
You want to deploy a transformer model for real-time text classification on a mobile device with limited memory and CPU. Which model is the best choice?
Consider model size and speed for mobile deployment.
DistilBERT is a smaller, faster version of BERT designed for resource-constrained environments, making it suitable for mobile devices.
When fine-tuning RoBERTa on a text classification task, increasing the maximum sequence length from 128 to 512 will most likely:
Think about how sequence length affects computation in transformers.
Longer sequences require more memory and computation, increasing training time, but can capture more context improving accuracy on longer inputs.
You fine-tune both RoBERTa-base and DistilBERT-base on the same sentiment analysis dataset. After evaluation, you get these results:
- RoBERTa-base: Accuracy=0.92, F1-score=0.91, Inference time=120ms
- DistilBERT-base: Accuracy=0.89, F1-score=0.88, Inference time=70ms
Which statement best summarizes the trade-off between these models?
Look at both accuracy and inference time values.
RoBERTa-base achieves higher accuracy and F1 but has longer inference time. DistilBERT trades some accuracy for faster inference.