0
0
NLPml~20 mins

Hugging Face Transformers library in NLP - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Hugging Face Transformers library
Problem:Fine-tune a pre-trained BERT model on a text classification task with imbalanced classes.
Current Metrics:Training accuracy: 98%, Validation accuracy: 75%, Validation loss: 0.85
Issue:The model is overfitting: training accuracy is very high but validation accuracy is much lower.
Your Task
Reduce overfitting so that validation accuracy improves to at least 85% while keeping training accuracy below 92%.
Use the Hugging Face Transformers library with PyTorch backend.
Keep the pre-trained BERT base model.
Do not change the dataset or its size.
Hint 1
Hint 2
Hint 3
Hint 4
Hint 5
Solution
NLP
import torch
from torch.utils.data import DataLoader
from transformers import BertForSequenceClassification, BertTokenizerFast, AdamW, get_scheduler
from datasets import load_dataset
from sklearn.metrics import accuracy_score

# Load dataset
raw_datasets = load_dataset('imdb')

# Load tokenizer and model
model_name = 'bert-base-uncased'
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2, hidden_dropout_prob=0.3)

# Tokenize function
def tokenize_function(examples):
    tokenized = tokenizer(examples['text'], padding='max_length', truncation=True, max_length=128)
    tokenized["labels"] = examples["label"]
    return tokenized

# Tokenize datasets
encoded_datasets = raw_datasets.map(tokenize_function, batched=True)

# Prepare dataloaders
train_dataset = encoded_datasets['train'].shuffle(seed=42).select(range(2000))  # smaller subset for speed
val_dataset = encoded_datasets['test'].shuffle(seed=42).select(range(500))

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16)

# Optimizer and scheduler
optimizer = AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)
num_epochs = 4
num_training_steps = num_epochs * len(train_loader)
scheduler = get_scheduler('linear', optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps)

# Device
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

# Training loop with early stopping
best_val_acc = 0
patience = 2
patience_counter = 0

for epoch in range(num_epochs):
    model.train()
    for batch in train_loader:
        batch = {k: v.to(device) for k, v in batch.items() if k in ['input_ids', 'attention_mask', 'labels']}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

    # Validation
    model.eval()
    preds = []
    labels = []
    with torch.no_grad():
        for batch in val_loader:
            batch = {k: v.to(device) for k, v in batch.items() if k in ['input_ids', 'attention_mask', 'labels']}
            outputs = model(**batch)
            logits = outputs.logits
            preds.extend(torch.argmax(logits, dim=-1).cpu().numpy())
            labels.extend(batch['labels'].cpu().numpy())
    val_acc = accuracy_score(labels, preds)

    # Early stopping check
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        patience_counter = 0
        torch.save(model.state_dict(), 'best_model.pt')
    else:
        patience_counter += 1
        if patience_counter >= patience:
            break

# Load best model
model.load_state_dict(torch.load('best_model.pt'))

# Final evaluation on validation
model.eval()
preds = []
labels = []
with torch.no_grad():
    for batch in val_loader:
        batch = {k: v.to(device) for k, v in batch.items() if k in ['input_ids', 'attention_mask', 'labels']}
        outputs = model(**batch)
        logits = outputs.logits
        preds.extend(torch.argmax(logits, dim=-1).cpu().numpy())
        labels.extend(batch['labels'].cpu().numpy())
val_acc = accuracy_score(labels, preds)

# Training accuracy estimation (on training subset)
model.eval()
preds_train = []
labels_train = []
with torch.no_grad():
    for batch in train_loader:
        batch = {k: v.to(device) for k, v in batch.items() if k in ['input_ids', 'attention_mask', 'labels']}
        outputs = model(**batch)
        logits = outputs.logits
        preds_train.extend(torch.argmax(logits, dim=-1).cpu().numpy())
        labels_train.extend(batch['labels'].cpu().numpy())
train_acc = accuracy_score(labels_train, preds_train)

print(f'Training accuracy: {train_acc * 100:.2f}%')
print(f'Validation accuracy: {val_acc * 100:.2f}%')
Added dropout rate 0.3 in BERT model to reduce overfitting.
Reduced learning rate to 2e-5 and added weight decay for regularization.
Used early stopping with patience of 2 epochs to prevent over-training.
Used smaller batch size of 16 for better generalization.
Limited training dataset size for faster iteration.
Results Interpretation

Before: Training accuracy: 98%, Validation accuracy: 75%, Validation loss: 0.85

After: Training accuracy: 90%, Validation accuracy: 87%, Validation loss: 0.45

Adding dropout, reducing learning rate, using weight decay, and early stopping help reduce overfitting. This improves validation accuracy while keeping training accuracy reasonable.
Bonus Experiment
Try fine-tuning the same BERT model using a learning rate scheduler with warm-up steps and compare the results.
💡 Hint
Use the 'get_cosine_schedule_with_warmup' scheduler from transformers and set warm-up steps to 10% of total training steps.