What is the main goal of model distillation in NLP?
Think about how a big model can help a smaller one learn.
Model distillation trains a smaller model (student) to imitate a larger, well-trained model (teacher), keeping performance but reducing size.
What is the output shape of the quantized model's embedding layer weights after applying 8-bit quantization?
import torch import torch.nn as nn class SimpleModel(nn.Module): def __init__(self): super().__init__() self.embedding = nn.Embedding(1000, 64) model = SimpleModel() quantized_model = torch.quantization.quantize_dynamic( model, {nn.Embedding}, dtype=torch.qint8 ) weight_shape = quantized_model.embedding.weight.shape print(weight_shape)
Quantization changes data type but not tensor shape.
The quantized embedding weights keep the original shape (1000, 64). Quantization changes precision, not dimensions.
You want to distill a large BERT model into a smaller one for mobile deployment. Which student model architecture is best suited?
Student model should be similar but smaller than the teacher.
Distillation works best when the student model architecture is similar but smaller, like a smaller BERT variant, to capture the teacher's knowledge efficiently.
Which hyperparameter is critical to set correctly when applying post-training quantization to an NLP model?
Quantization precision depends on this setting.
The bit-width (e.g., 8-bit) determines how finely weights and activations are represented, affecting model size and accuracy.
After distilling a large NLP model, which metric best shows if the smaller model retained the teacher's knowledge effectively?
Think about measuring how well the student predicts compared to the teacher.
Accuracy on validation data shows if the student model learned to mimic the teacher's predictions well, indicating successful distillation.