0
0
Data-analysis-pythonHow-ToBeginner ยท 4 min read

How to Clean Data in Python: Simple Steps and Examples

To clean data in Python, use the pandas library which provides functions like dropna() to remove missing values, fillna() to replace them, and drop_duplicates() to remove repeated rows. These tools help prepare your data for analysis by fixing common problems like missing or duplicate data.
๐Ÿ“

Syntax

Here are common pandas methods to clean data:

  • df.dropna(): removes rows with missing values.
  • df.fillna(value): replaces missing values with value.
  • df.drop_duplicates(): removes duplicate rows.
  • df.astype(type): changes data type of columns.
python
import pandas as pd

# Remove rows with missing values
df_clean = df.dropna()

# Replace missing values with 0
df_filled = df.fillna(0)

# Remove duplicate rows
df_unique = df.drop_duplicates()

# Change column type to integer
df['col'] = df['col'].astype(int)
๐Ÿ’ป

Example

This example shows how to clean a small dataset by removing missing values, filling missing values, and dropping duplicates.

python
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', None, 'David', 'Alice'],
    'Age': [25, None, 30, 22, 25],
    'City': ['New York', 'Los Angeles', 'New York', None, 'New York']
}

df = pd.DataFrame(data)

print('Original DataFrame:')
print(df)

# Remove rows with any missing values
df_no_missing = df.dropna()
print('\nAfter dropna():')
print(df_no_missing)

# Fill missing values with a placeholder
df_filled = df.fillna({'Name': 'Unknown', 'Age': 0, 'City': 'Unknown'})
print('\nAfter fillna():')
print(df_filled)

# Remove duplicate rows
df_unique = df_filled.drop_duplicates()
print('\nAfter drop_duplicates():')
print(df_unique)
Output
Original DataFrame: Name Age City 0 Alice 25.0 New York 1 Bob NaN Los Angeles 2 None 30.0 New York 3 David 22.0 None 4 Alice 25.0 New York After dropna(): Name Age City 0 Alice 25.0 New York 4 Alice 25.0 New York After fillna(): Name Age City 0 Alice 25.0 New York 1 Bob 0.0 Los Angeles 2 Unknown 30.0 New York 3 David 22.0 Unknown 4 Alice 25.0 New York After drop_duplicates(): Name Age City 0 Alice 25.0 New York 1 Bob 0.0 Los Angeles 2 Unknown 30.0 New York 3 David 22.0 Unknown
โš ๏ธ

Common Pitfalls

Common mistakes when cleaning data include:

  • Removing all rows with missing data without checking if important data is lost.
  • Filling missing values with inappropriate defaults that skew analysis.
  • Not resetting the index after dropping rows, which can cause confusion.
  • Forgetting to remove duplicates before analysis.
python
import pandas as pd

data = {'A': [1, None, 3, 3], 'B': [4, 5, None, 4]}
df = pd.DataFrame(data)

# Wrong: dropping missing values without checking
wrong = df.dropna()
print('Wrong dropna():')
print(wrong)

# Right: fill missing values carefully
right = df.fillna({'A': df['A'].mean(), 'B': df['B'].median()})
print('\nRight fillna():')
print(right)
Output
Wrong dropna(): A B 0 1.0 4.0 3 3.0 4.0 Right fillna(): A B 0 1.000000 4.0 1 2.333333 5.0 2 3.000000 4.5 3 3.000000 4.0
๐Ÿ“Š

Quick Reference

Summary tips for cleaning data in Python:

  • Use dropna() to remove missing data rows.
  • Use fillna() to replace missing values thoughtfully.
  • Use drop_duplicates() to remove repeated rows.
  • Check data types with df.dtypes and convert with astype().
  • Always inspect data before and after cleaning.
โœ…

Key Takeaways

Use pandas methods like dropna(), fillna(), and drop_duplicates() to clean data efficiently.
Always inspect your data before and after cleaning to avoid losing important information.
Fill missing values with meaningful defaults instead of arbitrary ones to keep data quality.
Remove duplicates to prevent skewed analysis results.
Check and convert data types to ensure correct processing.