Data-analysis-pythonHow-ToBeginner · 4 min read

How to Do EDA in Python: Step-by-Step Guide

To do EDA in Python, use libraries like pandas for data handling, matplotlib and seaborn for visualization. Start by loading data with pandas.read_csv(), then use head(), describe(), and plots like histograms or boxplots to understand data patterns and detect issues.

📐

Syntax

Here are common steps and functions used in Python for EDA:

import pandas as pd: Load pandas library for data handling.
df = pd.read_csv('file.csv'): Load data from a CSV file into a DataFrame.
df.head(): View first 5 rows to get a quick look.
df.describe(): Get summary statistics like mean, min, max.
df.info(): Check data types and missing values.
df.plot() or seaborn functions: Create visualizations to explore data.

python

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load data
df = pd.read_csv('data.csv')

# View first rows
print(df.head())

# Summary statistics
print(df.describe())

# Data info
print(df.info())

# Simple plot
sns.histplot(df['column_name'])
plt.show()

💻

Example

This example shows how to load a dataset, check its structure, and plot a histogram to see the distribution of a numeric column.

python

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load sample data from seaborn
df = sns.load_dataset('tips')

# Show first 5 rows
print(df.head())

# Summary statistics
print(df.describe())

# Check data info
print(df.info())

# Plot histogram of total bill
sns.histplot(df['total_bill'], bins=20)
plt.title('Distribution of Total Bill')
plt.xlabel('Total Bill')
plt.ylabel('Frequency')
plt.show()

Output

total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 3 23.68 3.31 Male No Sun Dinner 2 4 24.59 3.61 Female No Sun Dinner 4 total_bill tip size count 244.000000 244.000000 244.000000 mean 19.785943 2.998279 2.569672 std 8.902412 1.383638 0.951100 min 3.070000 1.000000 1.000000 25% 13.347500 2.000000 2.000000 50% 17.795000 2.900000 2.000000 75% 24.127500 3.562500 3.000000 max 50.810000 10.000000 6.000000 <class 'pandas.core.frame.DataFrame'> RangeIndex: 244 entries, 0 to 243 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 total_bill 244 non-null float64 1 tip 244 non-null float64 2 sex 244 non-null object 3 smoker 244 non-null object 4 day 244 non-null object 5 time 244 non-null object 6 size 244 non-null int64 dtypes: float64(2), int64(1), object(4) memory usage: 13.5+ KB

⚠️

Common Pitfalls

Common mistakes when doing EDA in Python include:

Not checking for missing values before analysis, which can cause errors or misleading results.
Using plots without labels or titles, making interpretation hard.
Ignoring data types, which can lead to wrong summary statistics.
Overlooking outliers that can skew analysis.

Always clean and understand your data before deep analysis.

python

import pandas as pd

# Wrong: Not checking missing values
# This may cause errors later

# Right: Check and handle missing values
missing = df.isnull().sum()
print('Missing values per column:\n', missing)

# Fill missing values example
# df['column'] = df['column'].fillna(df['column'].mean())

📊

Quick Reference

Here is a quick cheat-sheet for EDA steps in Python:

Step	Function/Method	Purpose
Load data	`pd.read_csv()`	Read CSV file into DataFrame
View data	`df.head()`	See first rows
Summary stats	`df.describe()`	Get numeric summaries
Check data types	`df.info()`	See column types and missing data
Check missing	`df.isnull().sum()`	Count missing values
Visualize	`sns.histplot()`, `df.plot()`	Plot data distributions
Handle missing	`df.fillna()`	Replace missing values

✅

Key Takeaways

Use pandas to load and inspect your data quickly with head(), describe(), and info().

Visualize data distributions and relationships using matplotlib and seaborn plots.

Always check and handle missing values before analysis to avoid errors.

Label your plots clearly to make insights easy to understand.

Look for outliers and data types to ensure accurate analysis.