0
0
Data-analysis-pythonHow-ToBeginner ยท 3 min read

How to Use Z-Score for Outliers Detection in Python

To detect outliers using z-score in Python, calculate the z-score for each data point using the mean and standard deviation, then flag points with z-scores above a chosen threshold (commonly 3 or -3) as outliers. You can use scipy.stats.zscore or compute it manually with numpy.
๐Ÿ“

Syntax

The z-score for a value is calculated as:

z = (x - mean) / std_dev

where:

  • x is the data point
  • mean is the average of all data points
  • std_dev is the standard deviation of the data

In Python, you can use scipy.stats.zscore(data) to get z-scores for all points.

python
from scipy.stats import zscore
import numpy as np

data = np.array([10, 12, 12, 13, 12, 11, 14, 100])
z_scores = zscore(data)
print(z_scores)
Output
[-0.46291005 -0.15430335 -0.15430335 0.15315306 -0.15430335 -0.3086067 0.4615094 2.42056494]
๐Ÿ’ป

Example

This example shows how to detect outliers in a list of numbers using z-score. Points with absolute z-score greater than 3 are considered outliers.

python
from scipy.stats import zscore
import numpy as np

data = np.array([10, 12, 12, 13, 12, 11, 14, 100])
z_scores = zscore(data)
outliers = data[np.abs(z_scores) > 3]
print("Outliers detected:", outliers)
Output
Outliers detected: [100]
โš ๏ธ

Common Pitfalls

  • Using a threshold too low (like 1 or 2) flags too many points as outliers.
  • Not handling small datasets where mean and std deviation are not stable.
  • Applying z-score on non-numeric or categorical data causes errors.
  • For skewed data, z-score may not detect outliers well; consider other methods.
python
import numpy as np

# Wrong: threshold too low flags many points

data = np.array([10, 12, 12, 13, 12, 11, 14, 100])
mean = np.mean(data)
std = np.std(data, ddof=0)
z_scores = (data - mean) / std
outliers_wrong = data[np.abs(z_scores) > 1]  # Too many flagged

# Right: use threshold 3
outliers_right = data[np.abs(z_scores) > 3]

print("Wrong outliers:", outliers_wrong)
print("Right outliers:", outliers_right)
Output
Wrong outliers: [ 10 12 12 13 12 11 14 100] Right outliers: [100]
๐Ÿ“Š

Quick Reference

Tips for using z-score to detect outliers:

  • Calculate z-score as (x - mean) / std_dev.
  • Common threshold is 3 or -3 for outliers.
  • Use scipy.stats.zscore for easy calculation.
  • Check data type and size before applying.
  • Consider data distribution; z-score works best for normal-like data.
โœ…

Key Takeaways

Calculate z-score to measure how far a point is from the mean in standard deviations.
Flag points with absolute z-score above 3 as outliers in most cases.
Use scipy.stats.zscore for simple and reliable z-score calculation.
Avoid too low thresholds to prevent false outlier detection.
Z-score works best on numeric data with roughly normal distribution.