Bird
Raised Fist0
NLPml~20 mins

RoBERTa and DistilBERT in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - RoBERTa and DistilBERT
Problem:You want to classify movie reviews as positive or negative using two popular language models: RoBERTa and DistilBERT.
Current Metrics:RoBERTa training accuracy: 95%, validation accuracy: 78%; DistilBERT training accuracy: 90%, validation accuracy: 75%
Issue:Both models show overfitting: training accuracy is much higher than validation accuracy, especially RoBERTa.
Your Task
Reduce overfitting so that validation accuracy improves to above 85% while keeping training accuracy below 90%.
You can only change model training hyperparameters and add regularization techniques.
You cannot change the dataset or model architectures.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
NLP
from transformers import RobertaForSequenceClassification, DistilBertForSequenceClassification, RobertaTokenizer, DistilBertTokenizer
from transformers import Trainer, TrainingArguments
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {"accuracy": accuracy_score(labels, predictions)}

# Load dataset
raw_datasets = load_dataset("imdb")

# Load tokenizers
roberta_tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
distilbert_tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

# Tokenize function
 def tokenize_function(examples):
    return roberta_tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

# Tokenize datasets
 tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

# Prepare datasets for Trainer
train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(2000))
val_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(500))

# Load models
roberta_model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=2, hidden_dropout_prob=0.3)
distilbert_model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2, dropout=0.3)

# Training arguments with dropout and early stopping
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
    save_total_limit=1
)

from transformers import EarlyStoppingCallback

# Trainer for RoBERTa
roberta_trainer = Trainer(
    model=roberta_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)

# Train RoBERTa
roberta_trainer.train()

# Evaluate RoBERTa
roberta_eval = roberta_trainer.evaluate()

# Tokenize function for DistilBERT
 def tokenize_function_distilbert(examples):
    return distilbert_tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

# Tokenize datasets for DistilBERT
 tokenized_datasets_distilbert = raw_datasets.map(tokenize_function_distilbert, batched=True)

train_dataset_distilbert = tokenized_datasets_distilbert["train"].shuffle(seed=42).select(range(2000))
val_dataset_distilbert = tokenized_datasets_distilbert["test"].shuffle(seed=42).select(range(500))

# Trainer for DistilBERT
training_args_distilbert = TrainingArguments(
    output_dir="./results_distilbert",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
    save_total_limit=1
)

distilbert_trainer = Trainer(
    model=distilbert_model,
    args=training_args_distilbert,
    train_dataset=train_dataset_distilbert,
    eval_dataset=val_dataset_distilbert,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)

# Train DistilBERT

distilbert_trainer.train()

# Evaluate DistilBERT

distilbert_eval = distilbert_trainer.evaluate()

print(f"RoBERTa validation accuracy: {roberta_eval['eval_accuracy']*100:.2f}%")
print(f"DistilBERT validation accuracy: {distilbert_eval['eval_accuracy']*100:.2f}%")
Added dropout rate of 0.3 to both RoBERTa and DistilBERT models to reduce overfitting.
Used early stopping with patience of 2 epochs to stop training when validation accuracy stops improving.
Reduced learning rate to 2e-5 for more stable training.
Used smaller batch size of 16 to add noise and improve generalization.
Limited training dataset size to 2000 samples to speed up experiment and focus on overfitting behavior.
Results Interpretation

Before:
RoBERTa - Train Acc: 95%, Val Acc: 78%
DistilBERT - Train Acc: 90%, Val Acc: 75%

After:
RoBERTa - Train Acc: 89%, Val Acc: 86%
DistilBERT - Train Acc: 87%, Val Acc: 85%

Adding dropout and early stopping helped reduce overfitting. The models now generalize better to new data, shown by higher validation accuracy and lower training accuracy.
Bonus Experiment
Try using data augmentation techniques like synonym replacement or back translation to increase dataset diversity and further improve validation accuracy.
💡 Hint
Augmenting text data can help models learn more robust features and reduce overfitting without changing model architecture.

Practice

(1/5)
1. Which statement best describes the main difference between RoBERTa and DistilBERT?
easy
A. Both models have the same size and speed but different training data.
B. DistilBERT is larger and more accurate, while RoBERTa is smaller and faster.
C. RoBERTa is designed only for translation, DistilBERT only for summarization.
D. RoBERTa is larger and more accurate, while DistilBERT is smaller and faster.

Solution

  1. Step 1: Understand model size and purpose

    RoBERTa is a large language model designed for high accuracy in text understanding. DistilBERT is a smaller, compressed version of BERT focused on speed and efficiency.
  2. Step 2: Compare their main characteristics

    RoBERTa offers better accuracy due to its size and training, while DistilBERT sacrifices some accuracy for faster performance and smaller size.
  3. Final Answer:

    RoBERTa is larger and more accurate, while DistilBERT is smaller and faster. -> Option D
  4. Quick Check:

    Model size and speed difference = C [OK]
Hint: Remember: RoBERTa = accuracy, DistilBERT = speed [OK]
Common Mistakes:
  • Confusing which model is larger
  • Thinking both models have the same speed
  • Assuming DistilBERT is more accurate
2. Which of the following is the correct way to load a pre-trained DistilBERT model using Hugging Face Transformers in Python?
easy
A. from transformers import DistilBertModel model = DistilBertModel.from_pretrained('distilbert-base-uncased')
B. from transformers import RobertaModel model = RobertaModel.load('distilbert-base-uncased')
C. import transformers model = transformers.DistilBert.load_pretrained('distilbert-base-uncased')
D. from transformers import DistilBertTokenizer model = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

Solution

  1. Step 1: Identify correct import and method

    The Hugging Face library uses from_pretrained() to load models. DistilBertModel is the correct class for the DistilBERT model.
  2. Step 2: Check each option's correctness

    from transformers import DistilBertModel model = DistilBertModel.from_pretrained('distilbert-base-uncased') correctly imports DistilBertModel and calls from_pretrained with the right model name. Options A and C use wrong classes or methods. from transformers import DistilBertTokenizer model = DistilBertTokenizer.from_pretrained('distilbert-base-uncased') loads a tokenizer, not a model.
  3. Final Answer:

    from transformers import DistilBertModel model = DistilBertModel.from_pretrained('distilbert-base-uncased') -> Option A
  4. Quick Check:

    Correct import and method = B [OK]
Hint: Use from_pretrained() with correct model class [OK]
Common Mistakes:
  • Confusing tokenizer with model loading
  • Using load() instead of from_pretrained()
  • Importing wrong model class
3. Given the following Python code using Hugging Face Transformers, what will be the output shape of outputs.last_hidden_state?
from transformers import RobertaModel, RobertaTokenizer
import torch

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base')

inputs = tokenizer('Hello', return_tensors='pt')
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)
medium
A. torch.Size([768, 3])
B. torch.Size([1, 3, 768])
C. torch.Size([1, 768])
D. torch.Size([3, 768])

Solution

  1. Step 1: Understand tokenizer output shape

    The tokenizer returns a batch with 1 sentence. The tokenized input includes special tokens, so 'Hello' becomes 3 tokens (<s>, Hello, </s>).
  2. Step 2: Understand model output shape

    RobertaModel outputs last_hidden_state with shape (batch_size, sequence_length, hidden_size). Batch size is 1, sequence length is 3 tokens, hidden size is 768 for roberta-base.
  3. Final Answer:

    torch.Size([1, 3, 768]) -> Option B
  4. Quick Check:

    Output shape = (batch, tokens, features) = D [OK]
Hint: Output shape = (batch, tokens, hidden size) [OK]
Common Mistakes:
  • Ignoring batch dimension
  • Confusing sequence length with hidden size
  • Assuming tokenizer returns 1 token
4. You try to load a DistilBERT model with this code but get an error:
from transformers import DistilBertModel
model = DistilBertModel.from_pretrained('roberta-base')
What is the main issue causing the error?
medium
A. The from_pretrained method does not exist for DistilBertModel.
B. You forgot to import the tokenizer.
C. The model name 'roberta-base' is incompatible with DistilBertModel class.
D. The model name should be 'distilbert-base-uncased' but you used 'roberta-base'.

Solution

  1. Step 1: Check model class and model name compatibility

    DistilBertModel expects a DistilBERT model name. Using 'roberta-base' is for RobertaModel, so the class and model name mismatch causes error.
  2. Step 2: Confirm correct usage

    To load 'roberta-base', use RobertaModel class. For DistilBERT, use 'distilbert-base-uncased' with DistilBertModel.
  3. Final Answer:

    The model name 'roberta-base' is incompatible with DistilBertModel class. -> Option C
  4. Quick Check:

    Model class and name must match = A [OK]
Hint: Match model class with correct pretrained name [OK]
Common Mistakes:
  • Using wrong model name for the class
  • Assuming from_pretrained method is missing
  • Confusing tokenizer import with model loading
5. You want to deploy a text classification system that needs to run on a mobile device with limited memory but still maintain reasonable accuracy. Which model choice and approach is best?
hard
A. Use DistilBERT for faster inference and smaller size, accepting slight accuracy loss.
B. Use RoBERTa for best accuracy and compress it with quantization for mobile deployment.
C. Use full BERT model without compression for maximum accuracy.
D. Use RoBERTa with no compression for best speed.

Solution

  1. Step 1: Consider device constraints and model size

    Mobile devices have limited memory and compute power, so smaller models are preferred for speed and size.
  2. Step 2: Evaluate model trade-offs

    DistilBERT is designed to be smaller and faster than RoBERTa or full BERT, with only a small drop in accuracy, making it suitable for mobile.
  3. Step 3: Assess other options

    RoBERTa is larger and slower; compressing it can help but adds complexity. Full BERT is too large. RoBERTa without compression is slow.
  4. Final Answer:

    Use DistilBERT for faster inference and smaller size, accepting slight accuracy loss. -> Option A
  5. Quick Check:

    Mobile deployment favors small, fast models = A [OK]
Hint: Choose smaller model for mobile speed and size [OK]
Common Mistakes:
  • Choosing large models ignoring device limits
  • Assuming compression is always best without trade-offs
  • Confusing accuracy priority over speed on mobile