RoBERTa and DistilBERT are models used for understanding language. We often use accuracy to see how many answers they get right. But because language tasks can be tricky, precision and recall help us understand if the model is good at finding the right answers without too many mistakes or misses. For example, in sentiment analysis, precision tells us how many positive labels were truly positive, and recall tells us how many positive cases the model found out of all positives.
RoBERTa and DistilBERT in NLP - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
| Predicted Positive | Predicted Negative |
|--------------------|--------------------|
| True Positive (TP) = 85 | False Negative (FN) = 15 |
| False Positive (FP) = 10 | True Negative (TN) = 90 |
Total samples = 85 + 15 + 10 + 90 = 200
Precision = TP / (TP + FP) = 85 / (85 + 10) = 0.8947
Recall = TP / (TP + FN) = 85 / (85 + 15) = 0.85
Accuracy = (TP + TN) / Total = (85 + 90) / 200 = 0.875
F1 Score = 2 * (Precision * Recall) / (Precision + Recall) ≈ 0.871
Imagine RoBERTa is used to detect spam emails. If it marks too many good emails as spam (low precision), users get annoyed. So, high precision is important here.
Now, if DistilBERT is used to find all harmful content in social media posts, missing any harmful post is bad (low recall). So, high recall is important.
Choosing between precision and recall depends on what is worse: false alarms or missed cases.
Good: Precision and recall above 85% means the model finds most correct answers and makes few mistakes. Accuracy above 85% shows overall strong performance.
Bad: Precision or recall below 50% means the model misses many correct answers or makes many wrong predictions. Accuracy near 50% means the model is guessing randomly.
- Accuracy paradox: High accuracy can be misleading if classes are unbalanced. For example, if 90% of data is negative, a model always predicting negative gets 90% accuracy but is useless.
- Data leakage: If test data leaks into training, metrics look better but model fails in real use.
- Overfitting: Model performs very well on training data but poorly on new data. Watch for big gaps between training and validation metrics.
Your RoBERTa model has 98% accuracy but only 12% recall on detecting fraud cases. Is it good for production? Why or why not?
Answer: No, it is not good. The model misses 88% of fraud cases (low recall), which is dangerous. High accuracy is misleading because fraud cases are rare. You need to improve recall to catch more fraud.
Practice
Solution
Step 1: Understand model size and purpose
RoBERTa is a large language model designed for high accuracy in text understanding. DistilBERT is a smaller, compressed version of BERT focused on speed and efficiency.Step 2: Compare their main characteristics
RoBERTa offers better accuracy due to its size and training, while DistilBERT sacrifices some accuracy for faster performance and smaller size.Final Answer:
RoBERTa is larger and more accurate, while DistilBERT is smaller and faster. -> Option DQuick Check:
Model size and speed difference = C [OK]
- Confusing which model is larger
- Thinking both models have the same speed
- Assuming DistilBERT is more accurate
Solution
Step 1: Identify correct import and method
The Hugging Face library uses from_pretrained() to load models. DistilBertModel is the correct class for the DistilBERT model.Step 2: Check each option's correctness
from transformers import DistilBertModel model = DistilBertModel.from_pretrained('distilbert-base-uncased') correctly imports DistilBertModel and calls from_pretrained with the right model name. Options A and C use wrong classes or methods. from transformers import DistilBertTokenizer model = DistilBertTokenizer.from_pretrained('distilbert-base-uncased') loads a tokenizer, not a model.Final Answer:
from transformers import DistilBertModel model = DistilBertModel.from_pretrained('distilbert-base-uncased') -> Option AQuick Check:
Correct import and method = B [OK]
- Confusing tokenizer with model loading
- Using load() instead of from_pretrained()
- Importing wrong model class
outputs.last_hidden_state?
from transformers import RobertaModel, RobertaTokenizer
import torch
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base')
inputs = tokenizer('Hello', return_tensors='pt')
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)Solution
Step 1: Understand tokenizer output shape
The tokenizer returns a batch with 1 sentence. The tokenized input includes special tokens, so 'Hello' becomes 3 tokens (<s>, Hello, </s>).Step 2: Understand model output shape
RobertaModel outputs last_hidden_state with shape (batch_size, sequence_length, hidden_size). Batch size is 1, sequence length is 3 tokens, hidden size is 768 for roberta-base.Final Answer:
torch.Size([1, 3, 768]) -> Option BQuick Check:
Output shape = (batch, tokens, features) = D [OK]
- Ignoring batch dimension
- Confusing sequence length with hidden size
- Assuming tokenizer returns 1 token
from transformers import DistilBertModel
model = DistilBertModel.from_pretrained('roberta-base')
What is the main issue causing the error?Solution
Step 1: Check model class and model name compatibility
DistilBertModel expects a DistilBERT model name. Using 'roberta-base' is for RobertaModel, so the class and model name mismatch causes error.Step 2: Confirm correct usage
To load 'roberta-base', use RobertaModel class. For DistilBERT, use 'distilbert-base-uncased' with DistilBertModel.Final Answer:
The model name 'roberta-base' is incompatible with DistilBertModel class. -> Option CQuick Check:
Model class and name must match = A [OK]
- Using wrong model name for the class
- Assuming from_pretrained method is missing
- Confusing tokenizer import with model loading
Solution
Step 1: Consider device constraints and model size
Mobile devices have limited memory and compute power, so smaller models are preferred for speed and size.Step 2: Evaluate model trade-offs
DistilBERT is designed to be smaller and faster than RoBERTa or full BERT, with only a small drop in accuracy, making it suitable for mobile.Step 3: Assess other options
RoBERTa is larger and slower; compressing it can help but adds complexity. Full BERT is too large. RoBERTa without compression is slow.Final Answer:
Use DistilBERT for faster inference and smaller size, accepting slight accuracy loss. -> Option AQuick Check:
Mobile deployment favors small, fast models = A [OK]
- Choosing large models ignoring device limits
- Assuming compression is always best without trade-offs
- Confusing accuracy priority over speed on mobile
