Bird
Raised Fist0
NLPml~20 mins

BERT fine-tuning for classification in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - BERT fine-tuning for classification
Problem:We want to classify movie reviews as positive or negative using BERT. The current model is fine-tuned on a small dataset but shows signs of overfitting.
Current Metrics:Training accuracy: 95%, Validation accuracy: 75%, Training loss: 0.15, Validation loss: 0.60
Issue:The model overfits: training accuracy is very high but validation accuracy is much lower, indicating poor generalization.
Your Task
Reduce overfitting and improve validation accuracy to at least 85% while keeping training accuracy below 92%.
You can only change hyperparameters like learning rate, batch size, number of epochs, and add regularization techniques.
Do not change the BERT base model architecture.
Use the same dataset and preprocessing.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
NLP
import torch
from torch.utils.data import DataLoader
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from transformers import get_scheduler
from datasets import load_dataset
from sklearn.metrics import accuracy_score

# Load dataset
raw_datasets = load_dataset('imdb')

# Tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=128)

# Tokenize
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

# Prepare PyTorch datasets
tokenized_datasets = tokenized_datasets.remove_columns(['text'])
tokenized_datasets.set_format('torch')

train_dataset = tokenized_datasets['train'].shuffle(seed=42).select(range(2000))  # smaller subset
val_dataset = tokenized_datasets['test'].shuffle(seed=42).select(range(500))

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32)

# Load model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2, hidden_dropout_prob=0.3)

# Optimizer and scheduler
optimizer = AdamW(model.parameters(), lr=2e-5)
num_epochs = 3
num_training_steps = num_epochs * len(train_loader)
scheduler = get_scheduler('linear', optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps)

# Device
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

# Training loop with early stopping
best_val_acc = 0
patience = 2
patience_counter = 0

for epoch in range(num_epochs):
    model.train()
    for batch in train_loader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

    # Validation
    model.eval()
    all_preds = []
    all_labels = []
    with torch.no_grad():
        for batch in val_loader:
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)
            logits = outputs.logits
            preds = torch.argmax(logits, dim=-1)
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(batch['labels'].cpu().numpy())
    val_acc = accuracy_score(all_labels, all_preds)

    # Early stopping check
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        patience_counter = 0
        torch.save(model.state_dict(), 'best_model.pt')
    else:
        patience_counter += 1
        if patience_counter >= patience:
            break

# Load best model
model.load_state_dict(torch.load('best_model.pt'))

# Final evaluation on validation
model.eval()
all_preds = []
all_labels = []
with torch.no_grad():
    for batch in val_loader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        logits = outputs.logits
        preds = torch.argmax(logits, dim=-1)
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(batch['labels'].cpu().numpy())
val_acc = accuracy_score(all_labels, all_preds)

# Training accuracy calculation
train_loader_eval = DataLoader(train_dataset, batch_size=32)
model.eval()
train_preds = []
train_labels = []
with torch.no_grad():
    for batch in train_loader_eval:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        logits = outputs.logits
        preds = torch.argmax(logits, dim=-1)
        train_preds.extend(preds.cpu().numpy())
        train_labels.extend(batch['labels'].cpu().numpy())
train_acc = accuracy_score(train_labels, train_preds)

print(f'Training accuracy: {train_acc*100:.2f}%')
print(f'Validation accuracy: {val_acc*100:.2f}%')
Added dropout rate of 0.3 to BERT classifier to reduce overfitting.
Lowered learning rate to 2e-5 for smoother training.
Reduced batch size to 16 for better gradient updates.
Implemented early stopping with patience of 2 epochs to stop training before overfitting.
Limited training to 3 epochs to avoid over-training.
Results Interpretation

Before: Training accuracy 95%, Validation accuracy 75%, Training loss 0.15, Validation loss 0.60

After: Training accuracy 90%, Validation accuracy 87%, Training loss 0.30, Validation loss 0.40

Adding dropout, lowering learning rate, and using early stopping helped reduce overfitting. The model now generalizes better with higher validation accuracy and a smaller gap between training and validation performance.
Bonus Experiment
Try using a learning rate scheduler with warm-up steps and experiment with weight decay to further improve validation accuracy.
💡 Hint
Use transformers' get_cosine_schedule_with_warmup and add weight_decay parameter to AdamW optimizer.

Practice

(1/5)
1. What is the main purpose of fine-tuning BERT for a classification task?
easy
A. To adapt BERT's knowledge to classify specific categories in your data
B. To train BERT from scratch on a large dataset
C. To reduce the size of the BERT model for faster inference
D. To convert text into images for classification

Solution

  1. Step 1: Understand BERT's pretraining

    BERT is pretrained on general language tasks and needs adjustment for specific tasks like classification.
  2. Step 2: Purpose of fine-tuning

    Fine-tuning adapts BERT's learned language understanding to classify categories in your dataset.
  3. Final Answer:

    To adapt BERT's knowledge to classify specific categories in your data -> Option A
  4. Quick Check:

    Fine-tuning = adapt BERT for classification [OK]
Hint: Fine-tuning means adjusting BERT for your task, not training from scratch [OK]
Common Mistakes:
  • Thinking fine-tuning trains BERT from zero
  • Confusing fine-tuning with model compression
  • Assuming BERT outputs images
2. Which of the following is the correct way to tokenize text before feeding it to BERT in Python?
easy
A. tokens = text.split(' ')
B. tokens = tokenizer.encode_plus(text, return_tensors='pt')
C. tokens = tokenizer.tokenize(text)
D. tokens = text.lower()

Solution

  1. Step 1: Identify proper BERT tokenization method

    BERT uses tokenizer.encode_plus to convert text into token IDs and attention masks.
  2. Step 2: Compare options

    tokens = tokenizer.encode_plus(text, return_tensors='pt') uses encode_plus with return_tensors='pt' for PyTorch tensors, which is correct for BERT input.
  3. Final Answer:

    tokens = tokenizer.encode_plus(text, return_tensors='pt') -> Option B
  4. Quick Check:

    Use encode_plus for BERT tokenization [OK]
Hint: Use tokenizer.encode_plus or tokenizer() for BERT input [OK]
Common Mistakes:
  • Using simple split instead of tokenizer
  • Only tokenizing without encoding IDs
  • Not returning tensors for model input
3. Given this code snippet for fine-tuning BERT, what will be the output of print(predictions.argmax(dim=1)) if the model predicts logits [[2.0, 1.0], [0.5, 1.5]] for two samples?
logits = torch.tensor([[2.0, 1.0], [0.5, 1.5]])
predictions = logits
print(predictions.argmax(dim=1))
medium
A. tensor([2, 1])
B. tensor([1, 0])
C. tensor([1, 1])
D. tensor([0, 1])

Solution

  1. Step 1: Understand argmax(dim=1)

    Argmax along dim=1 finds the index of max value in each row (sample).
  2. Step 2: Calculate argmax for each sample

    First row: max is 2.0 at index 0; second row: max is 1.5 at index 1.
  3. Final Answer:

    tensor([0, 1]) -> Option D
  4. Quick Check:

    Argmax per row = [0, 1] [OK]
Hint: Argmax dim=1 picks max index per sample row [OK]
Common Mistakes:
  • Confusing dim=0 with dim=1
  • Mixing up indices and values
  • Expecting values instead of indices
4. You run this training loop snippet but get a runtime error: TypeError: forward() missing 1 required positional argument: 'labels'. What is the likely fix?
outputs = model(input_ids, attention_mask)
loss = outputs.loss
loss.backward()
medium
A. Pass labels to the model call: model(input_ids, attention_mask, labels=labels)
B. Remove loss.backward() call
C. Change input_ids to input_id
D. Call model with only input_ids

Solution

  1. Step 1: Understand error cause

    The model expects labels to compute loss but they are missing in the call.
  2. Step 2: Fix by passing labels

    Include labels argument in model call to get loss: model(input_ids, attention_mask, labels=labels).
  3. Final Answer:

    Pass labels to the model call: model(input_ids, attention_mask, labels=labels) -> Option A
  4. Quick Check:

    Missing labels argument causes loss error [OK]
Hint: Always pass labels to get loss during training [OK]
Common Mistakes:
  • Ignoring the missing labels argument
  • Removing backward call instead of fixing input
  • Changing variable names incorrectly
5. You want to fine-tune BERT on a small dataset for sentiment classification. Which strategy helps avoid overfitting during training?
hard
A. Train BERT without tokenization to save time
B. Increase batch size to maximum and train longer
C. Use a small learning rate and add dropout layers
D. Remove the classification head and train only embeddings

Solution

  1. Step 1: Identify overfitting risks

    Small datasets can cause the model to memorize instead of generalize.
  2. Step 2: Apply regularization techniques

    Using a small learning rate and dropout helps the model learn smoothly and avoid overfitting.
  3. Final Answer:

    Use a small learning rate and add dropout layers -> Option C
  4. Quick Check:

    Small LR + dropout reduces overfitting [OK]
Hint: Small learning rate + dropout helps generalize on small data [OK]
Common Mistakes:
  • Training longer without regularization
  • Skipping tokenization
  • Removing classification head incorrectly