0
0
Pandasdata~10 mins

Combining multiple cleaning steps in Pandas - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Combining multiple cleaning steps
Load raw data
Step 1: Handle missing values
Step 2: Remove duplicates
Step 3: Fix data types
Step 4: Rename columns
Cleaned data ready for analysis
Data cleaning involves applying several steps one after another to prepare data for analysis.
Execution Sample
Pandas
import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, None, 4, 4],
    'B': ['x', 'y', 'y', None, 'y']
})

# Clean data
cleaned = (df.dropna()
             .drop_duplicates()
             .astype({'A': 'int'})
             .rename(columns={'A': 'Alpha', 'B': 'Beta'}))
This code loads data, drops missing values, removes duplicates, converts column A to int, and renames columns.
Execution Table
StepActionData SnapshotResulting DataFrame Shape
1Original DataFrame[{'A':1,'B':'x'},{'A':2,'B':'y'},{'A':None,'B':'y'},{'A':4,'B':None},{'A':4,'B':'y'}]5 rows, 2 cols
2Drop rows with missing values[{'A':1,'B':'x'},{'A':2,'B':'y'},{'A':4,'B':'y'},{'A':4,'B':'y'}]4 rows, 2 cols
3Drop duplicate rows[{'A':1,'B':'x'},{'A':2,'B':'y'},{'A':4,'B':'y'}]3 rows, 2 cols
4Convert column 'A' to int[{'A':1,'B':'x'},{'A':2,'B':'y'},{'A':4,'B':'y'}]3 rows, 2 cols
5Rename columns 'A'->'Alpha', 'B'->'Beta'[{'Alpha':1,'Beta':'x'},{'Alpha':2,'Beta':'y'},{'Alpha':4,'Beta':'y'}]3 rows, 2 cols
6End of cleaning stepsFinal cleaned DataFrame3 rows, 2 cols
💡 All cleaning steps applied; data ready for analysis.
Variable Tracker
VariableStartAfter Step 2After Step 3After Step 4Final
df[{'A':1,'B':'x'},{'A':2,'B':'y'},{'A':None,'B':'y'},{'A':4,'B':None},{'A':4,'B':'y'}][{'A':1,'B':'x'},{'A':2,'B':'y'},{'A':4,'B':'y'},{'A':4,'B':'y'}][{'A':1,'B':'x'},{'A':2,'B':'y'},{'A':4,'B':'y'}][{'A':1,'B':'x'},{'A':2,'B':'y'},{'A':4,'B':'y'}][{'Alpha':1,'Beta':'x'},{'Alpha':2,'Beta':'y'},{'Alpha':4,'Beta':'y'}]
cleanedN/AN/AN/AN/A[{'Alpha':1,'Beta':'x'},{'Alpha':2,'Beta':'y'},{'Alpha':4,'Beta':'y'}]
Key Moments - 3 Insights
Why does dropna() remove rows with any missing value instead of just one column?
dropna() by default removes rows where any column has missing data, as shown in execution_table step 2 where rows with None in 'A' or 'B' are removed.
Why does drop_duplicates() not remove any rows after dropna()?
After dropna(), the remaining rows are unique, so drop_duplicates() in step 3 finds no duplicates to remove.
Why do we convert column 'A' to int after dropping missing values?
Because missing values (None) prevent conversion to int, so we drop them first (step 2), then convert safely in step 4.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table at step 2. How many rows remain after dropna()?
A4
B3
C5
D2
💡 Hint
Check the 'Data Snapshot' and 'Resulting DataFrame Shape' columns at step 2.
At which step does the DataFrame columns get renamed?
AStep 3
BStep 5
CStep 4
DStep 6
💡 Hint
Look for the action mentioning renaming columns in the execution table.
If we skip dropna(), what problem will occur at step 4?
ADuplicates will increase
BNo problem, conversion works fine
CConversion to int will fail due to missing values
DColumns will not rename
💡 Hint
Refer to key moment about why dropna() is needed before converting types.
Concept Snapshot
Combining multiple cleaning steps in pandas:
- Chain methods like dropna(), drop_duplicates(), astype(), rename()
- Each step cleans data progressively
- Order matters: e.g., drop missing before type conversion
- Result is clean data ready for analysis
Full Transcript
This example shows how to combine multiple data cleaning steps using pandas. We start with raw data containing missing values and duplicates. First, we remove rows with missing values using dropna(). Next, we remove duplicate rows with drop_duplicates(). Then, we convert the data type of column 'A' to integer, which requires no missing values. Finally, we rename columns for clarity. Each step changes the data progressively, resulting in a clean DataFrame ready for analysis.