0
0
ML Pythonprogramming~5 mins

Data distributions and outliers in ML Python

Choose your learning style9 modes available
Introduction

Understanding data distributions helps us see how data points spread out. Detecting outliers shows unusual points that might affect our model.

Checking if your data is mostly centered around a value or spread out.
Finding strange values that don't fit the usual pattern in your data.
Deciding if you need to clean or transform data before training a model.
Understanding the shape of data to choose the right machine learning method.
Visualizing data to explain results to others clearly.
Syntax
ML Python
import numpy as np
import matplotlib.pyplot as plt

# Example data
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# Calculate basic statistics
mean = np.mean(data)
median = np.median(data)
std_dev = np.std(data)

# Detect outliers using IQR method
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
outliers = data[(data < Q1 - 1.5 * IQR) | (data > Q3 + 1.5 * IQR)]

# Plot histogram to see distribution
plt.hist(data, bins=20)
plt.show()

Use mean and median to understand the center of data.

IQR (Interquartile Range) helps find outliers by looking at the middle 50% of data.

Examples
This shows how mean is affected by the outlier 100, but median stays more stable.
ML Python
import numpy as np

# Simple data
data = np.array([1, 2, 2, 3, 4, 100])

mean = np.mean(data)
median = np.median(data)
print(f"Mean: {mean}, Median: {median}")
This finds values far from the middle 50% range, marking them as outliers.
ML Python
import numpy as np

# Detect outliers using IQR
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
outliers = data[(data < Q1 - 1.5 * IQR) | (data > Q3 + 1.5 * IQR)]
print(f"Outliers: {outliers}")
Histogram helps visualize how data points are spread and if outliers exist.
ML Python
import matplotlib.pyplot as plt

plt.hist(data, bins=10)
plt.title('Data Distribution')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
Sample Program

This program calculates basic statistics, finds outliers using the IQR method, and shows a histogram to visualize the data distribution including outliers.

ML Python
import numpy as np
import matplotlib.pyplot as plt

# Sample data with outliers
data = np.array([10, 12, 12, 13, 12, 11, 14, 13, 100, 105])

# Calculate statistics
mean = np.mean(data)
median = np.median(data)
std_dev = np.std(data)

# Detect outliers using IQR
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
outliers = data[(data < Q1 - 1.5 * IQR) | (data > Q3 + 1.5 * IQR)]

print(f"Mean: {mean:.2f}")
print(f"Median: {median}")
print(f"Standard Deviation: {std_dev:.2f}")
print(f"Outliers detected: {outliers}")

# Plot histogram
plt.hist(data, bins=10, color='skyblue', edgecolor='black')
plt.title('Data Distribution with Outliers')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
OutputSuccess
Important Notes

Outliers can skew mean but usually not median.

Visualizing data helps spot patterns and unusual points easily.

Always check for outliers before training models to improve accuracy.

Summary

Data distribution shows how data points spread out.

Outliers are unusual points that differ from most data.

Use statistics and plots to understand and detect outliers.