Understanding data distributions helps us see how data points spread out. Detecting outliers shows unusual points that might affect our model.
0
0
Data distributions and outliers in ML Python
Introduction
Checking if your data is mostly centered around a value or spread out.
Finding strange values that don't fit the usual pattern in your data.
Deciding if you need to clean or transform data before training a model.
Understanding the shape of data to choose the right machine learning method.
Visualizing data to explain results to others clearly.
Syntax
ML Python
import numpy as np import matplotlib.pyplot as plt # Example data data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) # Calculate basic statistics mean = np.mean(data) median = np.median(data) std_dev = np.std(data) # Detect outliers using IQR method Q1 = np.percentile(data, 25) Q3 = np.percentile(data, 75) IQR = Q3 - Q1 outliers = data[(data < Q1 - 1.5 * IQR) | (data > Q3 + 1.5 * IQR)] # Plot histogram to see distribution plt.hist(data, bins=20) plt.show()
Use mean and median to understand the center of data.
IQR (Interquartile Range) helps find outliers by looking at the middle 50% of data.
Examples
This shows how mean is affected by the outlier 100, but median stays more stable.
ML Python
import numpy as np # Simple data data = np.array([1, 2, 2, 3, 4, 100]) mean = np.mean(data) median = np.median(data) print(f"Mean: {mean}, Median: {median}")
This finds values far from the middle 50% range, marking them as outliers.
ML Python
import numpy as np # Detect outliers using IQR Q1 = np.percentile(data, 25) Q3 = np.percentile(data, 75) IQR = Q3 - Q1 outliers = data[(data < Q1 - 1.5 * IQR) | (data > Q3 + 1.5 * IQR)] print(f"Outliers: {outliers}")
Histogram helps visualize how data points are spread and if outliers exist.
ML Python
import matplotlib.pyplot as plt plt.hist(data, bins=10) plt.title('Data Distribution') plt.xlabel('Value') plt.ylabel('Frequency') plt.show()
Sample Program
This program calculates basic statistics, finds outliers using the IQR method, and shows a histogram to visualize the data distribution including outliers.
ML Python
import numpy as np import matplotlib.pyplot as plt # Sample data with outliers data = np.array([10, 12, 12, 13, 12, 11, 14, 13, 100, 105]) # Calculate statistics mean = np.mean(data) median = np.median(data) std_dev = np.std(data) # Detect outliers using IQR Q1 = np.percentile(data, 25) Q3 = np.percentile(data, 75) IQR = Q3 - Q1 outliers = data[(data < Q1 - 1.5 * IQR) | (data > Q3 + 1.5 * IQR)] print(f"Mean: {mean:.2f}") print(f"Median: {median}") print(f"Standard Deviation: {std_dev:.2f}") print(f"Outliers detected: {outliers}") # Plot histogram plt.hist(data, bins=10, color='skyblue', edgecolor='black') plt.title('Data Distribution with Outliers') plt.xlabel('Value') plt.ylabel('Frequency') plt.show()
OutputSuccess
Important Notes
Outliers can skew mean but usually not median.
Visualizing data helps spot patterns and unusual points easily.
Always check for outliers before training models to improve accuracy.
Summary
Data distribution shows how data points spread out.
Outliers are unusual points that differ from most data.
Use statistics and plots to understand and detect outliers.