How to Clean Data in Python: Simple Steps and Examples
To clean data in Python, use the
pandas library which provides functions like dropna() to remove missing values, fillna() to replace them, and drop_duplicates() to remove repeated rows. These tools help prepare your data for analysis by fixing common problems like missing or duplicate data.Syntax
Here are common pandas methods to clean data:
df.dropna(): removes rows with missing values.df.fillna(value): replaces missing values withvalue.df.drop_duplicates(): removes duplicate rows.df.astype(type): changes data type of columns.
python
import pandas as pd # Remove rows with missing values df_clean = df.dropna() # Replace missing values with 0 df_filled = df.fillna(0) # Remove duplicate rows df_unique = df.drop_duplicates() # Change column type to integer df['col'] = df['col'].astype(int)
Example
This example shows how to clean a small dataset by removing missing values, filling missing values, and dropping duplicates.
python
import pandas as pd data = { 'Name': ['Alice', 'Bob', None, 'David', 'Alice'], 'Age': [25, None, 30, 22, 25], 'City': ['New York', 'Los Angeles', 'New York', None, 'New York'] } df = pd.DataFrame(data) print('Original DataFrame:') print(df) # Remove rows with any missing values df_no_missing = df.dropna() print('\nAfter dropna():') print(df_no_missing) # Fill missing values with a placeholder df_filled = df.fillna({'Name': 'Unknown', 'Age': 0, 'City': 'Unknown'}) print('\nAfter fillna():') print(df_filled) # Remove duplicate rows df_unique = df_filled.drop_duplicates() print('\nAfter drop_duplicates():') print(df_unique)
Output
Original DataFrame:
Name Age City
0 Alice 25.0 New York
1 Bob NaN Los Angeles
2 None 30.0 New York
3 David 22.0 None
4 Alice 25.0 New York
After dropna():
Name Age City
0 Alice 25.0 New York
4 Alice 25.0 New York
After fillna():
Name Age City
0 Alice 25.0 New York
1 Bob 0.0 Los Angeles
2 Unknown 30.0 New York
3 David 22.0 Unknown
4 Alice 25.0 New York
After drop_duplicates():
Name Age City
0 Alice 25.0 New York
1 Bob 0.0 Los Angeles
2 Unknown 30.0 New York
3 David 22.0 Unknown
Common Pitfalls
Common mistakes when cleaning data include:
- Removing all rows with missing data without checking if important data is lost.
- Filling missing values with inappropriate defaults that skew analysis.
- Not resetting the index after dropping rows, which can cause confusion.
- Forgetting to remove duplicates before analysis.
python
import pandas as pd data = {'A': [1, None, 3, 3], 'B': [4, 5, None, 4]} df = pd.DataFrame(data) # Wrong: dropping missing values without checking wrong = df.dropna() print('Wrong dropna():') print(wrong) # Right: fill missing values carefully right = df.fillna({'A': df['A'].mean(), 'B': df['B'].median()}) print('\nRight fillna():') print(right)
Output
Wrong dropna():
A B
0 1.0 4.0
3 3.0 4.0
Right fillna():
A B
0 1.000000 4.0
1 2.333333 5.0
2 3.000000 4.5
3 3.000000 4.0
Quick Reference
Summary tips for cleaning data in Python:
- Use
dropna()to remove missing data rows. - Use
fillna()to replace missing values thoughtfully. - Use
drop_duplicates()to remove repeated rows. - Check data types with
df.dtypesand convert withastype(). - Always inspect data before and after cleaning.
Key Takeaways
Use pandas methods like dropna(), fillna(), and drop_duplicates() to clean data efficiently.
Always inspect your data before and after cleaning to avoid losing important information.
Fill missing values with meaningful defaults instead of arbitrary ones to keep data quality.
Remove duplicates to prevent skewed analysis results.
Check and convert data types to ensure correct processing.