0
0
Data-analysis-pythonHow-ToBeginner ยท 4 min read

How to Remove Outliers in Python: Simple Methods Explained

To remove outliers in Python, you can use the Interquartile Range (IQR) method or the Z-score method. These methods identify data points that are unusually high or low and filter them out from your dataset.
๐Ÿ“

Syntax

Here are two common ways to remove outliers in Python:

  • IQR method: Calculate the first quartile (Q1) and third quartile (Q3), then find the interquartile range (IQR = Q3 - Q1). Outliers are points below Q1 - 1.5*IQR or above Q3 + 1.5*IQR.
  • Z-score method: Calculate the Z-score for each data point, which measures how many standard deviations it is from the mean. Points with Z-score above a threshold (commonly 3) are outliers.
python
import numpy as np
import pandas as pd

def remove_outliers_iqr(data):
    Q1 = np.percentile(data, 25)
    Q3 = np.percentile(data, 75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return data[(data >= lower_bound) & (data <= upper_bound)]

def remove_outliers_zscore(data, threshold=3):
    mean = np.mean(data)
    std = np.std(data)
    z_scores = (data - mean) / std
    return data[np.abs(z_scores) < threshold]
๐Ÿ’ป

Example

This example shows how to remove outliers from a list of numbers using both IQR and Z-score methods.

python
import numpy as np

data = np.array([10, 12, 12, 13, 12, 11, 14, 100, 12, 13, 11, 12])

# IQR method
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
filtered_iqr = data[(data >= lower_bound) & (data <= upper_bound)]

# Z-score method
mean = np.mean(data)
std = np.std(data)
z_scores = (data - mean) / std
filtered_zscore = data[np.abs(z_scores) < 3]

print("Original data:", data)
print("After IQR filtering:", filtered_iqr)
print("After Z-score filtering:", filtered_zscore)
Output
Original data: [ 10 12 12 13 12 11 14 100 12 13 11 12] After IQR filtering: [10 12 12 13 12 11 14 12 13 11 12] After Z-score filtering: [10 12 12 13 12 11 14 12 13 11 12]
โš ๏ธ

Common Pitfalls

  • Using a fixed threshold without understanding your data can remove valid points.
  • Not checking data distribution before applying Z-score can lead to wrong outlier detection.
  • For small datasets, IQR or Z-score might not be reliable.
  • For multi-dimensional data, apply outlier removal on each feature separately or use advanced methods.
python
import numpy as np

data = np.array([1, 2, 2, 3, 2, 1000])

# Wrong: Using Z-score threshold too low removes normal points
mean = np.mean(data)
std = np.std(data)
z_scores = (data - mean) / std
filtered_wrong = data[np.abs(z_scores) < 1]  # Too strict

# Right: Use a reasonable threshold
filtered_right = data[np.abs(z_scores) < 3]

print("Wrong filtering:", filtered_wrong)
print("Right filtering:", filtered_right)
Output
Wrong filtering: [1 2 2 3 2] Right filtering: [ 1 2 2 3 2 1000]
๐Ÿ“Š

Quick Reference

Tips for removing outliers in Python:

  • Use IQR for skewed data or when you want a simple rule.
  • Use Z-score for normally distributed data.
  • Always visualize data (boxplots, histograms) before and after filtering.
  • Adjust thresholds based on your data context.
โœ…

Key Takeaways

Use IQR or Z-score methods to identify and remove outliers in Python datasets.
Choose the method based on your data distribution and size.
Visualize data before and after removing outliers to ensure correctness.
Avoid overly strict thresholds that remove valid data points.
For complex data, consider feature-wise or advanced outlier detection methods.