How to Do EDA in Python: Step-by-Step Guide
To do
EDA in Python, use libraries like pandas for data handling, matplotlib and seaborn for visualization. Start by loading data with pandas.read_csv(), then use head(), describe(), and plots like histograms or boxplots to understand data patterns and detect issues.Syntax
Here are common steps and functions used in Python for EDA:
import pandas as pd: Load pandas library for data handling.df = pd.read_csv('file.csv'): Load data from a CSV file into a DataFrame.df.head(): View first 5 rows to get a quick look.df.describe(): Get summary statistics like mean, min, max.df.info(): Check data types and missing values.df.plot()orseabornfunctions: Create visualizations to explore data.
python
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # Load data df = pd.read_csv('data.csv') # View first rows print(df.head()) # Summary statistics print(df.describe()) # Data info print(df.info()) # Simple plot sns.histplot(df['column_name']) plt.show()
Example
This example shows how to load a dataset, check its structure, and plot a histogram to see the distribution of a numeric column.
python
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # Load sample data from seaborn df = sns.load_dataset('tips') # Show first 5 rows print(df.head()) # Summary statistics print(df.describe()) # Check data info print(df.info()) # Plot histogram of total bill sns.histplot(df['total_bill'], bins=20) plt.title('Distribution of Total Bill') plt.xlabel('Total Bill') plt.ylabel('Frequency') plt.show()
Output
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
total_bill tip size
count 244.000000 244.000000 244.000000
mean 19.785943 2.998279 2.569672
std 8.902412 1.383638 0.951100
min 3.070000 1.000000 1.000000
25% 13.347500 2.000000 2.000000
50% 17.795000 2.900000 2.000000
75% 24.127500 3.562500 3.000000
max 50.810000 10.000000 6.000000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 total_bill 244 non-null float64
1 tip 244 non-null float64
2 sex 244 non-null object
3 smoker 244 non-null object
4 day 244 non-null object
5 time 244 non-null object
6 size 244 non-null int64
dtypes: float64(2), int64(1), object(4)
memory usage: 13.5+ KB
Common Pitfalls
Common mistakes when doing EDA in Python include:
- Not checking for missing values before analysis, which can cause errors or misleading results.
- Using plots without labels or titles, making interpretation hard.
- Ignoring data types, which can lead to wrong summary statistics.
- Overlooking outliers that can skew analysis.
Always clean and understand your data before deep analysis.
python
import pandas as pd # Wrong: Not checking missing values # This may cause errors later # Right: Check and handle missing values missing = df.isnull().sum() print('Missing values per column:\n', missing) # Fill missing values example # df['column'] = df['column'].fillna(df['column'].mean())
Quick Reference
Here is a quick cheat-sheet for EDA steps in Python:
| Step | Function/Method | Purpose |
|---|---|---|
| Load data | pd.read_csv() | Read CSV file into DataFrame |
| View data | df.head() | See first rows |
| Summary stats | df.describe() | Get numeric summaries |
| Check data types | df.info() | See column types and missing data |
| Check missing | df.isnull().sum() | Count missing values |
| Visualize | sns.histplot(), df.plot() | Plot data distributions |
| Handle missing | df.fillna() | Replace missing values |
Key Takeaways
Use pandas to load and inspect your data quickly with head(), describe(), and info().
Visualize data distributions and relationships using matplotlib and seaborn plots.
Always check and handle missing values before analysis to avoid errors.
Label your plots clearly to make insights easy to understand.
Look for outliers and data types to ensure accurate analysis.