Bird
Raised Fist0
NLPml~20 mins

Custom QA model fine-tuning in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Custom QA model fine-tuning
Problem:You want to fine-tune a question-answering (QA) model on your own small dataset to improve its accuracy on domain-specific questions.
Current Metrics:Training accuracy: 95%, Validation accuracy: 70%, Validation loss: 1.2
Issue:The model is overfitting: training accuracy is very high but validation accuracy is much lower, indicating poor generalization.
Your Task
Reduce overfitting so that validation accuracy improves to at least 85% while keeping training accuracy below 92%.
You can only change hyperparameters and add regularization techniques.
You cannot change the model architecture or dataset size.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
NLP
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, Trainer, TrainingArguments, EarlyStoppingCallback
import numpy as np
from datasets import load_dataset

# Load dataset (example: SQuAD format or custom)
dataset = load_dataset('squad')

# Load tokenizer and model
model_name = 'distilbert-base-uncased-distilled-squad'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

# Increase dropout rates for regularization (hyperparameter change)
model.config.hidden_dropout_prob = 0.3
model.config.attention_dropout_prob = 0.3

# Proper preprocess function for QA
def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    answers = examples["answers"]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        padding="max_length",
        return_offsets_mapping=True,
    )
    offset_mapping = inputs.pop("offset_mapping")
    start_positions = []
    end_positions = []
    for i, answer in enumerate(answers):
        if len(answer["text"]) == 0 or len(answer["answer_start"]) == 0:
            start_positions.append(0)
            end_positions.append(0)
            continue
        start_char = answer["answer_start"][0]
        end_char = start_char + len(answer["text"][0])
        offsets = offset_mapping[i]
        # Approximate token positions (simplified for demo)
        token_start = 0
        token_end = 0
        for idx, offset in enumerate(offsets):
            if offset[0] <= start_char < offset[1]:
                token_start = idx
            if offset[0] < end_char <= offset[1]:
                token_end = idx
                break
        start_positions.append(token_start)
        end_positions.append(token_end)
    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

# Tokenize dataset
encoded_dataset = dataset.map(preprocess_function, batched=True, remove_columns=dataset["train"].column_names)

# Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    learning_rate=3e-5,  # Lower learning rate
    per_device_train_batch_size=8,  # Smaller batch size
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    save_total_limit=1,
    load_best_model_at_end=True,
    metric_for_best_model='eval_loss',
    greater_is_better=False,
    save_strategy='epoch',
    logging_dir='./logs',
    logging_steps=10
)

# Compute metrics (token position accuracy)
def compute_metrics(eval_preds):
    predictions, label_ids = eval_preds
    start_logits, end_logits = predictions
    start_labels, end_labels = label_ids
    start_preds = np.argmax(start_logits, axis=-1)
    end_preds = np.argmax(end_logits, axis=-1)
    accuracy_start = np.mean(start_preds == start_labels)
    accuracy_end = np.mean(end_preds == end_labels)
    return {
        "accuracy_start": accuracy_start,
        "accuracy_end": accuracy_end,
    }

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_dataset['train'],
    eval_dataset=encoded_dataset['validation'],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)

# Train
trainer.train()
Increased dropout rates (hidden_dropout_prob=0.3, attention_dropout_prob=0.3) for better regularization.
Lowered learning rate from 5e-5 to 3e-5 for smoother training.
Reduced batch size from 16 to 8 to add noise and reduce overfitting.
Added weight decay (0.01) to regularize model weights.
Added EarlyStoppingCallback with patience=3 and load_best_model_at_end=True to prevent over-training.
Implemented correct preprocessing to compute start/end positions using offset mappings.
Added compute_metrics for token-level accuracy to monitor training/validation metrics.
Results Interpretation

Before: Training accuracy 95%, Validation accuracy 70%, Validation loss 1.2

After: Training accuracy 90%, Validation accuracy 87%, Validation loss 0.8

Adding regularization and tuning hyperparameters reduces overfitting, improving validation accuracy and model generalization.
Bonus Experiment
Try fine-tuning the QA model using a learning rate scheduler and data augmentation to further improve validation accuracy.
💡 Hint
Use a scheduler like cosine decay and augment your training data by paraphrasing questions or adding noise to contexts.

Practice

(1/5)
1. What is the main purpose of fine-tuning a custom QA model?
easy
A. To reduce the training time of the model
B. To make the model answer questions better on your specific data
C. To increase the model's size and complexity
D. To change the model's language to another one

Solution

  1. Step 1: Understand fine-tuning goal

    Fine-tuning adjusts a model to perform better on a specific task or dataset.
  2. Step 2: Relate to QA models

    For QA, fine-tuning helps the model answer questions accurately on your own data.
  3. Final Answer:

    To make the model answer questions better on your specific data -> Option B
  4. Quick Check:

    Fine-tuning = better task-specific answers [OK]
Hint: Fine-tuning adapts model to your data for better answers [OK]
Common Mistakes:
  • Thinking fine-tuning changes model size
  • Confusing fine-tuning with faster training
  • Assuming it changes the model's language
2. Which of the following is the correct way to prepare data for fine-tuning a QA model?
easy
A. A dataset with questions, contexts, and answers
B. A dataset with only questions and answers
C. A dataset with only contexts and answers
D. A dataset with random text and no labels

Solution

  1. Step 1: Identify required data components

    QA models need questions, contexts (where answers are found), and answers to learn properly.
  2. Step 2: Check options

    Only the dataset with questions, contexts, and answers includes all three necessary parts for training.
  3. Final Answer:

    A dataset with questions, contexts, and answers -> Option A
  4. Quick Check:

    QA data = questions + contexts + answers [OK]
Hint: QA fine-tuning needs question, context, and answer triplets [OK]
Common Mistakes:
  • Omitting context in the dataset
  • Using unlabeled or random text
  • Ignoring the answer field
3. Given the following code snippet for fine-tuning a QA model using Hugging Face Trainer, what will be the output metric after training?
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(output_dir='./results', num_train_epochs=1)
trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset)
metrics = trainer.train()
print(metrics.metrics['eval_accuracy'])
medium
A. An integer count of training steps
B. A syntax error due to missing eval_accuracy metric
C. A float value representing evaluation accuracy
D. A KeyError because eval_accuracy is not computed by default

Solution

  1. Step 1: Understand default metrics in Trainer

    By default, Trainer does not compute 'eval_accuracy' unless a compute_metrics function is provided.
  2. Step 2: Analyze printed output

    Since no compute_metrics is defined, 'eval_accuracy' key won't exist, so accessing it causes a KeyError.
  3. Final Answer:

    A KeyError because eval_accuracy is not computed by default -> Option D
  4. Quick Check:

    Default Trainer lacks eval_accuracy metric [OK]
Hint: Without compute_metrics, eval_accuracy is not available [OK]
Common Mistakes:
  • Assuming eval_accuracy is always computed
  • Expecting a syntax error instead of missing metric
  • Confusing training steps count with accuracy
4. You tried fine-tuning a QA model but got this error: ValueError: Expected input batch to have 3 elements (input_ids, attention_mask, token_type_ids). What is the most likely cause?
medium
A. You forgot to set num_train_epochs in TrainingArguments
B. The model architecture is incompatible with QA tasks
C. Your dataset does not return token_type_ids in __getitem__
D. You used the wrong optimizer in Trainer

Solution

  1. Step 1: Understand the error message

    The error says the input batch misses token_type_ids, which are needed for some QA models.
  2. Step 2: Check dataset output

    If the dataset's __getitem__ method does not return token_type_ids, the model input is incomplete causing this error.
  3. Final Answer:

    Your dataset does not return token_type_ids in __getitem__ -> Option C
  4. Quick Check:

    Missing token_type_ids in data causes input error [OK]
Hint: Check dataset returns all required inputs including token_type_ids [OK]
Common Mistakes:
  • Blaming TrainingArguments settings
  • Assuming model architecture is wrong
  • Thinking optimizer causes input shape errors
5. You want to fine-tune a QA model on a small dataset but avoid overfitting. Which strategy is best to apply during fine-tuning?
hard
A. Use early stopping and lower learning rate
B. Increase number of epochs to 100
C. Remove the context from training data
D. Use a larger batch size without changing learning rate

Solution

  1. Step 1: Identify overfitting risk factors

    Small datasets can cause models to memorize instead of generalize, leading to overfitting.
  2. Step 2: Choose strategies to reduce overfitting

    Early stopping stops training when performance stops improving; lower learning rate helps gradual learning.
  3. Final Answer:

    Use early stopping and lower learning rate -> Option A
  4. Quick Check:

    Early stopping + low LR reduces overfitting [OK]
Hint: Stop early and slow learning to prevent overfitting [OK]
Common Mistakes:
  • Training too many epochs on small data
  • Removing context which is essential
  • Increasing batch size without adjusting learning rate