Ml-pythonHow-ToBeginner · 3 min read

What Metrics to Monitor for ML Model Performance and Health

To monitor an ML model, track loss and accuracy during training for overall performance. For classification, also monitor precision, recall, and F1-score to understand error types. For regression, use mean squared error (MSE) or mean absolute error (MAE) to measure prediction quality.

📐

Syntax

Common metrics for ML models are calculated using simple function calls or methods from libraries like scikit-learn or TensorFlow.

loss: Measures how far predictions are from true values during training.
accuracy: Percentage of correct predictions.
precision: How many predicted positives are actually positive.
recall: How many actual positives are correctly predicted.
F1-score: Harmonic mean of precision and recall.
mean squared error (MSE): Average squared difference between predicted and actual values.
mean absolute error (MAE): Average absolute difference between predicted and actual values.

python

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, mean_squared_error, mean_absolute_error

# Example predictions and true labels
y_true = [0, 1, 1, 0, 1]
y_pred = [0, 0, 1, 0, 1]

# Classification metrics
acc = accuracy_score(y_true, y_pred)
prec = precision_score(y_true, y_pred)
rec = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

# Regression example
y_true_reg = [3.0, -0.5, 2.0, 7.0]
y_pred_reg = [2.5, 0.0, 2.1, 7.8]
mse = mean_squared_error(y_true_reg, y_pred_reg)
mae = mean_absolute_error(y_true_reg, y_pred_reg)

💻

Example

This example shows how to calculate key classification and regression metrics using scikit-learn. It demonstrates how to interpret model predictions with these metrics.

python

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, mean_squared_error, mean_absolute_error

# Classification example
true_labels = [1, 0, 1, 1, 0, 0, 1]
pred_labels = [1, 0, 0, 1, 0, 1, 1]

accuracy = accuracy_score(true_labels, pred_labels)
precision = precision_score(true_labels, pred_labels)
recall = recall_score(true_labels, pred_labels)
f1 = f1_score(true_labels, pred_labels)

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-score: {f1:.2f}")

# Regression example
true_values = [2.5, 0.0, 2.1, 7.8]
pred_values = [3.0, -0.5, 2.0, 7.0]
mse = mean_squared_error(true_values, pred_values)
mae = mean_absolute_error(true_values, pred_values)

print(f"Mean Squared Error: {mse:.2f}")
print(f"Mean Absolute Error: {mae:.2f}")

Output

Accuracy: 0.71 Precision: 0.75 Recall: 0.60 F1-score: 0.67 Mean Squared Error: 0.38 Mean Absolute Error: 0.50

⚠️

Common Pitfalls

Many beginners make these mistakes when monitoring ML metrics:

Relying only on accuracy for imbalanced data, which can be misleading.
Ignoring precision and recall when false positives or false negatives matter.
Not monitoring loss during training to detect overfitting or underfitting.
Using regression metrics on classification tasks or vice versa.
Failing to track metrics on validation/test data, only on training data.

python

from sklearn.metrics import accuracy_score, precision_score

# Wrong: Using accuracy only on imbalanced data
true = [0, 0, 0, 0, 1]
pred = [0, 0, 0, 0, 0]
acc = accuracy_score(true, pred)
prec = precision_score(true, pred, zero_division=0)
print(f"Accuracy: {acc:.2f}")  # High accuracy but model misses positive
print(f"Precision: {prec:.2f}")  # Precision shows problem

# Right: Use precision and recall together
from sklearn.metrics import recall_score
rec = recall_score(true, pred, zero_division=0)
print(f"Recall: {rec:.2f}")

Output

Accuracy: 0.80 Precision: 0.00 Recall: 0.00

📊

Quick Reference

Here is a quick summary of key ML metrics to monitor:

Metric	Type	What it Measures	When to Use
Loss	Training	How well model fits training data	Always during training
Accuracy	Classification	Percent correct predictions	Balanced classes
Precision	Classification	Correct positive predictions	When false positives are costly
Recall	Classification	Detected actual positives	When missing positives is costly
F1-score	Classification	Balance of precision and recall	Imbalanced classes
Mean Squared Error (MSE)	Regression	Average squared error	Regression tasks
Mean Absolute Error (MAE)	Regression	Average absolute error	Regression tasks

✅

Key Takeaways

Always monitor loss and accuracy during training to track model learning.

Use precision, recall, and F1-score for classification, especially with imbalanced data.

For regression, track mean squared error and mean absolute error to measure prediction quality.

Avoid relying on a single metric; combine multiple metrics for a full picture.

Monitor metrics on validation/test data to detect overfitting or poor generalization.