You want to fine-tune a Hugging Face transformer model for a text classification task with limited labeled data. Which model is best suited to start with?
Think about models designed for classification tasks and pretrained on language data.
BERT models are pretrained with masked language modeling and are commonly fine-tuned for classification tasks by adding a classification head. GPT models are mainly for generation, CNNs are for images, and word2vec lacks transformer architecture.
Consider this snippet for fine-tuning a Hugging Face model using Trainer API:
from transformers import Trainer, TrainingArguments training_args = TrainingArguments(output_dir='./results', num_train_epochs=1, per_device_train_batch_size=2) trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset) train_result = trainer.train() print(train_result.metrics)
What is the expected type and content of train_result.metrics after training?
Think about what the Trainer API returns after training completes.
The Trainer's train() method returns a TrainOutput object whose 'metrics' attribute is a dictionary with keys like 'train_loss', 'train_runtime', and 'train_samples_per_second'.
You are fine-tuning a pretrained transformer model on a small dataset. Which learning rate is most appropriate to avoid overfitting and unstable training?
Fine-tuning usually requires smaller learning rates than training from scratch.
Learning rates like 5e-5 are typical for fine-tuning transformers to ensure stable updates and prevent overfitting. Very high learning rates cause training to diverge.
You try to fine-tune a Hugging Face model but get this error:
RuntimeError: The size of tensor a (10) must match the size of tensor b (5) at non-singleton dimension 1
What is the most likely cause?
Think about tensor size mismatches related to classification output dimensions.
This error usually happens when the model's final classification layer expects a different number of classes than the dataset provides. For example, model output size 10 vs dataset labels 5.
When fine-tuning a large pretrained transformer, freezing some layers can help training. Which statement best explains why freezing layers is useful?
Consider the effect of freezing on training stability and resource use.
Freezing layers means their weights do not update during training, which reduces memory and computation needs and helps avoid overfitting when data is limited.