What Metrics to Monitor for ML Model Performance and Health
To monitor an ML model, track
loss and accuracy during training for overall performance. For classification, also monitor precision, recall, and F1-score to understand error types. For regression, use mean squared error (MSE) or mean absolute error (MAE) to measure prediction quality.Syntax
Common metrics for ML models are calculated using simple function calls or methods from libraries like scikit-learn or TensorFlow.
loss: Measures how far predictions are from true values during training.accuracy: Percentage of correct predictions.precision: How many predicted positives are actually positive.recall: How many actual positives are correctly predicted.F1-score: Harmonic mean of precision and recall.mean squared error (MSE): Average squared difference between predicted and actual values.mean absolute error (MAE): Average absolute difference between predicted and actual values.
python
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, mean_squared_error, mean_absolute_error # Example predictions and true labels y_true = [0, 1, 1, 0, 1] y_pred = [0, 0, 1, 0, 1] # Classification metrics acc = accuracy_score(y_true, y_pred) prec = precision_score(y_true, y_pred) rec = recall_score(y_true, y_pred) f1 = f1_score(y_true, y_pred) # Regression example y_true_reg = [3.0, -0.5, 2.0, 7.0] y_pred_reg = [2.5, 0.0, 2.1, 7.8] mse = mean_squared_error(y_true_reg, y_pred_reg) mae = mean_absolute_error(y_true_reg, y_pred_reg)
Example
This example shows how to calculate key classification and regression metrics using scikit-learn. It demonstrates how to interpret model predictions with these metrics.
python
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, mean_squared_error, mean_absolute_error # Classification example true_labels = [1, 0, 1, 1, 0, 0, 1] pred_labels = [1, 0, 0, 1, 0, 1, 1] accuracy = accuracy_score(true_labels, pred_labels) precision = precision_score(true_labels, pred_labels) recall = recall_score(true_labels, pred_labels) f1 = f1_score(true_labels, pred_labels) print(f"Accuracy: {accuracy:.2f}") print(f"Precision: {precision:.2f}") print(f"Recall: {recall:.2f}") print(f"F1-score: {f1:.2f}") # Regression example true_values = [2.5, 0.0, 2.1, 7.8] pred_values = [3.0, -0.5, 2.0, 7.0] mse = mean_squared_error(true_values, pred_values) mae = mean_absolute_error(true_values, pred_values) print(f"Mean Squared Error: {mse:.2f}") print(f"Mean Absolute Error: {mae:.2f}")
Output
Accuracy: 0.71
Precision: 0.75
Recall: 0.60
F1-score: 0.67
Mean Squared Error: 0.38
Mean Absolute Error: 0.50
Common Pitfalls
Many beginners make these mistakes when monitoring ML metrics:
- Relying only on
accuracyfor imbalanced data, which can be misleading. - Ignoring
precisionandrecallwhen false positives or false negatives matter. - Not monitoring
lossduring training to detect overfitting or underfitting. - Using regression metrics on classification tasks or vice versa.
- Failing to track metrics on validation/test data, only on training data.
python
from sklearn.metrics import accuracy_score, precision_score # Wrong: Using accuracy only on imbalanced data true = [0, 0, 0, 0, 1] pred = [0, 0, 0, 0, 0] acc = accuracy_score(true, pred) prec = precision_score(true, pred, zero_division=0) print(f"Accuracy: {acc:.2f}") # High accuracy but model misses positive print(f"Precision: {prec:.2f}") # Precision shows problem # Right: Use precision and recall together from sklearn.metrics import recall_score rec = recall_score(true, pred, zero_division=0) print(f"Recall: {rec:.2f}")
Output
Accuracy: 0.80
Precision: 0.00
Recall: 0.00
Quick Reference
Here is a quick summary of key ML metrics to monitor:
| Metric | Type | What it Measures | When to Use |
|---|---|---|---|
| Loss | Training | How well model fits training data | Always during training |
| Accuracy | Classification | Percent correct predictions | Balanced classes |
| Precision | Classification | Correct positive predictions | When false positives are costly |
| Recall | Classification | Detected actual positives | When missing positives is costly |
| F1-score | Classification | Balance of precision and recall | Imbalanced classes |
| Mean Squared Error (MSE) | Regression | Average squared error | Regression tasks |
| Mean Absolute Error (MAE) | Regression | Average absolute error | Regression tasks |
Key Takeaways
Always monitor loss and accuracy during training to track model learning.
Use precision, recall, and F1-score for classification, especially with imbalanced data.
For regression, track mean squared error and mean absolute error to measure prediction quality.
Avoid relying on a single metric; combine multiple metrics for a full picture.
Monitor metrics on validation/test data to detect overfitting or poor generalization.