When fine-tuning models with Hugging Face, the key metrics depend on the task. For text classification, accuracy shows how many texts are correctly labeled. For tasks like question answering or summarization, metrics like F1 score or ROUGE measure quality better. These metrics help us know if the model learned well from new data.
Hugging Face fine-tuning in Prompt Engineering / GenAI - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
Actual \ Predicted | Positive | Negative
-------------------|----------|---------
Positive | 80 | 20
Negative | 10 | 90
Here, 80 true positives (TP), 90 true negatives (TN), 10 false negatives (FN), and 20 false positives (FP) help calculate precision, recall, and accuracy.
Fine-tuning a spam detector: high precision means fewer good emails marked as spam (less annoyance). High recall means catching most spam emails (better filtering). Depending on what matters more, you adjust the model or threshold.
For medical text classification, high recall is critical to catch all disease mentions, even if some false alarms happen.
Good: Accuracy above 85%, F1 score above 0.8, balanced precision and recall close to each other.
Bad: Accuracy near random chance (e.g., 50% for two classes), very low recall (missing many positives), or very low precision (too many false alarms).
- Accuracy paradox: High accuracy but poor recall if data is imbalanced.
- Data leakage: Training data accidentally includes test examples, inflating metrics.
- Overfitting: Training metrics look great but test metrics drop, showing poor generalization.
- Ignoring task-specific metrics: Using accuracy for generation tasks where BLEU or ROUGE is better.
Your fine-tuned model shows 98% accuracy but only 12% recall on fraud detection. Is it good for production? Why or why not?
Answer: No, it is not good. The low recall means the model misses most fraud cases, which is dangerous. High accuracy is misleading because fraud is rare, so the model mostly predicts non-fraud correctly but fails to catch fraud.
Practice
Solution
Step 1: Understand what fine-tuning means
Fine-tuning means taking a model already trained on a large dataset and adjusting it to work well on a new, specific task.Step 2: Identify the purpose in Hugging Face context
Hugging Face fine-tuning adapts the pre-trained model's knowledge to your task, improving accuracy without training from scratch.Final Answer:
To adapt the model to perform well on a specific new task -> Option AQuick Check:
Fine-tuning = adapt model to new task [OK]
- Thinking fine-tuning trains a model from scratch
- Confusing fine-tuning with model compression
- Assuming fine-tuning changes the programming language
Solution
Step 1: Recall the correct class name and parameters
The Hugging Face library uses the class TrainingArguments with parameters like output_dir and num_train_epochs.Step 2: Match the correct syntax
training_args = TrainingArguments(output_dir='output', num_train_epochs=3) uses the correct class name and parameter names exactly as in the Hugging Face API.Final Answer:
training_args = TrainingArguments(output_dir='output', num_train_epochs=3) -> Option DQuick Check:
TrainingArguments with output_dir and num_train_epochs [OK]
- Using wrong class names like TrainerArguments or TrainArgs
- Using incorrect parameter names like epochs instead of num_train_epochs
- Confusing Trainer and TrainingArguments classes
print(len(tokenized_datasets['train'][0]['input_ids']))?
from datasets import load_dataset
from transformers import AutoTokenizer
dataset = load_dataset('imdb', split='train[:1%]')
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
tokenized_datasets = dataset.map(lambda x: tokenizer(x['text'], truncation=True, padding='max_length', max_length=128))
Solution
Step 1: Understand tokenizer parameters
The tokenizer is called with padding='max_length' and max_length=128, so all sequences are padded or truncated to length 128.Step 2: Check the length of input_ids
Since padding to max_length is applied, each tokenized input's input_ids list length is exactly 128.Final Answer:
128 -> Option BQuick Check:
Padding to max_length = fixed length 128 [OK]
- Assuming variable length without padding
- Confusing max_length with 512 default
- Expecting error due to missing batch=True
TypeError: Trainer() missing 1 required positional argument: 'model'. What is the likely fix?Solution
Step 1: Understand the error message
The error says the Trainer constructor needs a 'model' argument but it was not provided.Step 2: Fix by providing the model
When creating a Trainer, you must pass the pre-trained model as the 'model' parameter to avoid this error.Final Answer:
Pass the pre-trained model as the 'model' argument when creating Trainer -> Option CQuick Check:
Trainer requires model argument [OK]
- Forgetting to pass model to Trainer
- Confusing Trainer with TrainingArguments
- Calling train() before creating Trainer
Solution
Step 1: Identify overfitting prevention methods
Using fewer epochs and evaluation with early stopping helps stop training before overfitting.Step 2: Evaluate options for best practice
Set num_train_epochs=3 and use evaluation_strategy='steps' with early stopping sets a moderate number of epochs and enables evaluation with early stopping, which is best to avoid overfitting.Final Answer:
Set num_train_epochs=3 and use evaluation_strategy='steps' with early stopping -> Option AQuick Check:
Early stopping + moderate epochs prevent overfitting [OK]
- Using too many epochs causing overfitting
- Setting learning rate too high or too low
- Ignoring evaluation and early stopping
