Practice - 5 Tasks
Answer the questions below
1fill in blank
easyComplete the code to load a pre-trained model for distillation.
NLP
from transformers import DistilBertForSequenceClassification model = DistilBertForSequenceClassification.from_pretrained([1])
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Using the full BERT model name instead of the distilled version.
Choosing a model from a different architecture like GPT-2.
✗ Incorrect
DistilBERT is a smaller, distilled version of BERT, so the correct model name is "distilbert-base-uncased".
2fill in blank
mediumComplete the code to apply dynamic quantization to a PyTorch model.
NLP
import torch model = torch.quantization.quantize_dynamic(model, [1], dtype=torch.qint8)
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Trying to quantize activation functions like ReLU.
Using convolution layers which are less common in NLP models.
✗ Incorrect
Dynamic quantization is commonly applied to linear layers to reduce model size and speed up inference.
3fill in blank
hardFix the error in the code to correctly perform knowledge distillation training.
NLP
teacher_outputs = teacher_model(input_ids)
student_outputs = student_model(input_ids)
loss = distillation_loss(student_outputs, teacher_outputs[1]) Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Using hidden states or attention outputs instead of logits.
Trying to access labels from model outputs.
✗ Incorrect
The logits are the raw predictions needed to compute the distillation loss between teacher and student.
4fill in blank
hardFill both blanks to create a quantized model and prepare it for inference.
NLP
import torch.quantization quantized_model = torch.quantization.quantize_dynamic(model, [1], dtype=[2])
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Using float16 which is not dynamic quantization dtype.
Trying to quantize convolution layers in NLP models.
✗ Incorrect
Dynamic quantization targets Linear layers and uses qint8 data type for quantization.
5fill in blank
hardFill all three blanks to define a distillation loss combining student and teacher outputs.
NLP
import torch.nn.functional as F alpha = 0.5 T = 2.0 loss = alpha * F.kl_div(F.log_softmax(student_outputs[1] / T, dim=1), F.softmax(teacher_outputs[2] / T, dim=1), reduction='batchmean') * (T * T) + (1 - alpha) * F.cross_entropy(student_outputs[3], labels)
Drag options to blanks, or click blank then click option'
Attempts:
3 left
💡 Hint
Common Mistakes
Using hidden states instead of logits.
Mixing up student and teacher outputs.
✗ Incorrect
The distillation loss uses logits from both student and teacher models for KL divergence and student logits for cross-entropy with true labels.