How to Use Z-Score for Outliers Detection in Python
To detect outliers using
z-score in Python, calculate the z-score for each data point using the mean and standard deviation, then flag points with z-scores above a chosen threshold (commonly 3 or -3) as outliers. You can use scipy.stats.zscore or compute it manually with numpy.Syntax
The z-score for a value is calculated as:
z = (x - mean) / std_dev
where:
xis the data pointmeanis the average of all data pointsstd_devis the standard deviation of the data
In Python, you can use scipy.stats.zscore(data) to get z-scores for all points.
python
from scipy.stats import zscore import numpy as np data = np.array([10, 12, 12, 13, 12, 11, 14, 100]) z_scores = zscore(data) print(z_scores)
Output
[-0.46291005 -0.15430335 -0.15430335 0.15315306 -0.15430335 -0.3086067
0.4615094 2.42056494]
Example
This example shows how to detect outliers in a list of numbers using z-score. Points with absolute z-score greater than 3 are considered outliers.
python
from scipy.stats import zscore import numpy as np data = np.array([10, 12, 12, 13, 12, 11, 14, 100]) z_scores = zscore(data) outliers = data[np.abs(z_scores) > 3] print("Outliers detected:", outliers)
Output
Outliers detected: [100]
Common Pitfalls
- Using a threshold too low (like 1 or 2) flags too many points as outliers.
- Not handling small datasets where mean and std deviation are not stable.
- Applying z-score on non-numeric or categorical data causes errors.
- For skewed data, z-score may not detect outliers well; consider other methods.
python
import numpy as np # Wrong: threshold too low flags many points data = np.array([10, 12, 12, 13, 12, 11, 14, 100]) mean = np.mean(data) std = np.std(data, ddof=0) z_scores = (data - mean) / std outliers_wrong = data[np.abs(z_scores) > 1] # Too many flagged # Right: use threshold 3 outliers_right = data[np.abs(z_scores) > 3] print("Wrong outliers:", outliers_wrong) print("Right outliers:", outliers_right)
Output
Wrong outliers: [ 10 12 12 13 12 11 14 100]
Right outliers: [100]
Quick Reference
Tips for using z-score to detect outliers:
- Calculate z-score as
(x - mean) / std_dev. - Common threshold is
3or-3for outliers. - Use
scipy.stats.zscorefor easy calculation. - Check data type and size before applying.
- Consider data distribution; z-score works best for normal-like data.
Key Takeaways
Calculate z-score to measure how far a point is from the mean in standard deviations.
Flag points with absolute z-score above 3 as outliers in most cases.
Use scipy.stats.zscore for simple and reliable z-score calculation.
Avoid too low thresholds to prevent false outlier detection.
Z-score works best on numeric data with roughly normal distribution.