Experiment - RoBERTa and DistilBERT

Problem:You want to classify movie reviews as positive or negative using two popular language models: RoBERTa and DistilBERT.

Current Metrics:RoBERTa training accuracy: 95%, validation accuracy: 78%; DistilBERT training accuracy: 90%, validation accuracy: 75%

Issue:Both models show overfitting: training accuracy is much higher than validation accuracy, especially RoBERTa.

Your Task

Reduce overfitting so that validation accuracy improves to above 85% while keeping training accuracy below 90%.

You can only change model training hyperparameters and add regularization techniques.

You cannot change the dataset or model architectures.

Hint 1

Hint 2

Hint 3

Hint 4

Solution

NLP

from transformers import RobertaForSequenceClassification, DistilBertForSequenceClassification, RobertaTokenizer, DistilBertTokenizer
from transformers import Trainer, TrainingArguments
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {"accuracy": accuracy_score(labels, predictions)}

# Load dataset
raw_datasets = load_dataset("imdb")

# Load tokenizers
roberta_tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
distilbert_tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

# Tokenize function
 def tokenize_function(examples):
    return roberta_tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

# Tokenize datasets
 tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

# Prepare datasets for Trainer
train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(2000))
val_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(500))

# Load models
roberta_model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=2, hidden_dropout_prob=0.3)
distilbert_model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2, dropout=0.3)

# Training arguments with dropout and early stopping
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
    save_total_limit=1
)

from transformers import EarlyStoppingCallback

# Trainer for RoBERTa
roberta_trainer = Trainer(
    model=roberta_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)

# Train RoBERTa
roberta_trainer.train()

# Evaluate RoBERTa
roberta_eval = roberta_trainer.evaluate()

# Tokenize function for DistilBERT
 def tokenize_function_distilbert(examples):
    return distilbert_tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

# Tokenize datasets for DistilBERT
 tokenized_datasets_distilbert = raw_datasets.map(tokenize_function_distilbert, batched=True)

train_dataset_distilbert = tokenized_datasets_distilbert["train"].shuffle(seed=42).select(range(2000))
val_dataset_distilbert = tokenized_datasets_distilbert["test"].shuffle(seed=42).select(range(500))

# Trainer for DistilBERT
training_args_distilbert = TrainingArguments(
    output_dir="./results_distilbert",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
    save_total_limit=1
)

distilbert_trainer = Trainer(
    model=distilbert_model,
    args=training_args_distilbert,
    train_dataset=train_dataset_distilbert,
    eval_dataset=val_dataset_distilbert,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)

# Train DistilBERT

distilbert_trainer.train()

# Evaluate DistilBERT

distilbert_eval = distilbert_trainer.evaluate()

print(f"RoBERTa validation accuracy: {roberta_eval['eval_accuracy']*100:.2f}%")
print(f"DistilBERT validation accuracy: {distilbert_eval['eval_accuracy']*100:.2f}%")

Added dropout rate of 0.3 to both RoBERTa and DistilBERT models to reduce overfitting.

Used early stopping with patience of 2 epochs to stop training when validation accuracy stops improving.

Reduced learning rate to 2e-5 for more stable training.

Used smaller batch size of 16 to add noise and improve generalization.

Limited training dataset size to 2000 samples to speed up experiment and focus on overfitting behavior.

Results Interpretation

Before:
RoBERTa - Train Acc: 95%, Val Acc: 78%
DistilBERT - Train Acc: 90%, Val Acc: 75%

After:
RoBERTa - Train Acc: 89%, Val Acc: 86%
DistilBERT - Train Acc: 87%, Val Acc: 85%

Adding dropout and early stopping helped reduce overfitting. The models now generalize better to new data, shown by higher validation accuracy and lower training accuracy.

Bonus Experiment

Try using data augmentation techniques like synonym replacement or back translation to increase dataset diversity and further improve validation accuracy.

💡 Hint

Augmenting text data can help models learn more robust features and reduce overfitting without changing model architecture.