BERT fine-tuning helps a pre-trained language model learn to classify text into categories. It saves time and works well even with small data.
BERT fine-tuning for classification in NLP
Start learning this pattern below
Jump into concepts and practice - no test required
or
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Introduction
Syntax
NLP
from transformers import BertForSequenceClassification, BertTokenizer from torch.utils.data import DataLoader import torch # Load pre-trained BERT model and tokenizer model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2) tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') # Prepare data: tokenize texts inputs = tokenizer(texts, padding=True, truncation=True, return_tensors='pt') labels = torch.tensor(labels) dataset = torch.utils.data.TensorDataset(inputs['input_ids'], inputs['attention_mask'], labels) dataloader = DataLoader(dataset, batch_size=8) # Training loop example optimizer = torch.optim.Adam(model.parameters(), lr=5e-5) model.train() for epoch in range(3): for batch in dataloader: input_ids, attention_mask, labels = batch outputs = model(input_ids, attention_mask=attention_mask, labels=labels) loss = outputs.loss loss.backward() optimizer.step() optimizer.zero_grad()
Use BertForSequenceClassification for classification tasks.
Tokenize text with padding and truncation to fit BERT's input size.
Examples
NLP
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)
NLP
inputs = tokenizer(['Hello world!'], padding=True, truncation=True, return_tensors='pt')
NLP
outputs = model(input_ids, attention_mask=attention_mask, labels=labels) loss = outputs.loss logits = outputs.logits
Sample Model
This code fine-tunes BERT on two example sentences for sentiment classification. It prints the loss and predicted classes after one training pass.
NLP
from transformers import BertForSequenceClassification, BertTokenizer from torch.utils.data import DataLoader, TensorDataset import torch # Sample data texts = ['I love this movie', 'This movie is bad'] labels = [1, 0] # 1=positive, 0=negative # Load model and tokenizer model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2) tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') # Tokenize inputs = tokenizer(texts, padding=True, truncation=True, return_tensors='pt') labels_tensor = torch.tensor(labels) dataset = TensorDataset(inputs['input_ids'], inputs['attention_mask'], labels_tensor) dataloader = DataLoader(dataset, batch_size=2) # Optimizer optimizer = torch.optim.Adam(model.parameters(), lr=5e-5) # Training loop (1 epoch for demo) model.train() for batch in dataloader: input_ids, attention_mask, labels_batch = batch outputs = model(input_ids, attention_mask=attention_mask, labels=labels_batch) loss = outputs.loss logits = outputs.logits loss.backward() optimizer.step() optimizer.zero_grad() # Evaluation model.eval() with torch.no_grad(): outputs = model(input_ids, attention_mask=attention_mask) predictions = torch.argmax(outputs.logits, dim=1) print(f'Loss after training: {loss.item():.4f}') print(f'Predictions: {predictions.tolist()}')
Important Notes
Fine-tuning usually needs a GPU for faster training.
Use a small learning rate like 5e-5 to avoid breaking the pre-trained model.
More epochs and data improve accuracy but take longer.
Summary
BERT fine-tuning adapts a powerful language model to your classification task.
Tokenize text properly before feeding it to BERT.
Train with a small learning rate and check loss and predictions to see progress.
Practice
1. What is the main purpose of fine-tuning BERT for a classification task?
easy
Solution
Step 1: Understand BERT's pretraining
BERT is pretrained on general language tasks and needs adjustment for specific tasks like classification.Step 2: Purpose of fine-tuning
Fine-tuning adapts BERT's learned language understanding to classify categories in your dataset.Final Answer:
To adapt BERT's knowledge to classify specific categories in your data -> Option AQuick Check:
Fine-tuning = adapt BERT for classification [OK]
Hint: Fine-tuning means adjusting BERT for your task, not training from scratch [OK]
Common Mistakes:
- Thinking fine-tuning trains BERT from zero
- Confusing fine-tuning with model compression
- Assuming BERT outputs images
2. Which of the following is the correct way to tokenize text before feeding it to BERT in Python?
easy
Solution
Step 1: Identify proper BERT tokenization method
BERT uses tokenizer.encode_plus to convert text into token IDs and attention masks.Step 2: Compare options
tokens = tokenizer.encode_plus(text, return_tensors='pt') uses encode_plus with return_tensors='pt' for PyTorch tensors, which is correct for BERT input.Final Answer:
tokens = tokenizer.encode_plus(text, return_tensors='pt') -> Option BQuick Check:
Use encode_plus for BERT tokenization [OK]
Hint: Use tokenizer.encode_plus or tokenizer() for BERT input [OK]
Common Mistakes:
- Using simple split instead of tokenizer
- Only tokenizing without encoding IDs
- Not returning tensors for model input
3. Given this code snippet for fine-tuning BERT, what will be the output of
print(predictions.argmax(dim=1)) if the model predicts logits [[2.0, 1.0], [0.5, 1.5]] for two samples?logits = torch.tensor([[2.0, 1.0], [0.5, 1.5]]) predictions = logits print(predictions.argmax(dim=1))
medium
Solution
Step 1: Understand argmax(dim=1)
Argmax along dim=1 finds the index of max value in each row (sample).Step 2: Calculate argmax for each sample
First row: max is 2.0 at index 0; second row: max is 1.5 at index 1.Final Answer:
tensor([0, 1]) -> Option DQuick Check:
Argmax per row = [0, 1] [OK]
Hint: Argmax dim=1 picks max index per sample row [OK]
Common Mistakes:
- Confusing dim=0 with dim=1
- Mixing up indices and values
- Expecting values instead of indices
4. You run this training loop snippet but get a runtime error:
TypeError: forward() missing 1 required positional argument: 'labels'. What is the likely fix?outputs = model(input_ids, attention_mask) loss = outputs.loss loss.backward()
medium
Solution
Step 1: Understand error cause
The model expects labels to compute loss but they are missing in the call.Step 2: Fix by passing labels
Include labels argument in model call to get loss: model(input_ids, attention_mask, labels=labels).Final Answer:
Pass labels to the model call: model(input_ids, attention_mask, labels=labels) -> Option AQuick Check:
Missing labels argument causes loss error [OK]
Hint: Always pass labels to get loss during training [OK]
Common Mistakes:
- Ignoring the missing labels argument
- Removing backward call instead of fixing input
- Changing variable names incorrectly
5. You want to fine-tune BERT on a small dataset for sentiment classification. Which strategy helps avoid overfitting during training?
hard
Solution
Step 1: Identify overfitting risks
Small datasets can cause the model to memorize instead of generalize.Step 2: Apply regularization techniques
Using a small learning rate and dropout helps the model learn smoothly and avoid overfitting.Final Answer:
Use a small learning rate and add dropout layers -> Option CQuick Check:
Small LR + dropout reduces overfitting [OK]
Hint: Small learning rate + dropout helps generalize on small data [OK]
Common Mistakes:
- Training longer without regularization
- Skipping tokenization
- Removing classification head incorrectly
