0
0
Pandasdata~10 mins

Building cleaning pipelines with pipe() in Pandas - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Building cleaning pipelines with pipe()
Start with raw DataFrame
Define cleaning functions
Apply pipe() with first function
Apply pipe() with second function
...
Get cleaned DataFrame as output
Start with raw data, define cleaning steps as functions, then apply them step-by-step using pipe() to get a clean DataFrame.
Execution Sample
Pandas
import pandas as pd

def drop_missing(df):
    return df.dropna()

def to_lowercase(df):
    df['Name'] = df['Name'].str.lower()
    return df

raw = pd.DataFrame({'Name': ['Alice', None, 'BOB'], 'Age': [25, 30, None]})
cleaned = raw.pipe(drop_missing).pipe(to_lowercase)
This code cleans a DataFrame by dropping rows with missing values, then converting the 'Name' column to lowercase using pipe().
Execution Table
StepDataFrame StateActionResulting DataFrame
Start{'Name': ['Alice', None, 'BOB'], 'Age': [25, 30, None]}Initial raw DataFrame{'Name': ['Alice', None, 'BOB'], 'Age': [25, 30, None]}
1{'Name': ['Alice', None, 'BOB'], 'Age': [25, 30, None]}Apply drop_missing (drop rows with any NaN){'Name': ['Alice'], 'Age': [25]}
2{'Name': ['Alice'], 'Age': [25]}Apply to_lowercase (convert 'Name' to lowercase){'Name': ['alice'], 'Age': [25]}
End{'Name': ['alice'], 'Age': [25]}No more pipe steps{'Name': ['alice'], 'Age': [25]}
💡 All pipe functions applied; final cleaned DataFrame obtained.
Variable Tracker
VariableStartAfter 1After 2Final
raw{'Name': ['Alice', None, 'BOB'], 'Age': [25, 30, None]}{'Name': ['Alice', None, 'BOB'], 'Age': [25, 30, None]}{'Name': ['Alice', None, 'BOB'], 'Age': [25, 30, None]}{'Name': ['Alice', None, 'BOB'], 'Age': [25, 30, None]}
cleanedN/A{'Name': ['Alice'], 'Age': [25]}{'Name': ['alice'], 'Age': [25]}{'Name': ['alice'], 'Age': [25]}
Key Moments - 3 Insights
Why does the DataFrame lose rows after the first pipe step?
Because drop_missing uses dropna(), which removes any row with missing values, as shown in execution_table step 1.
Does pipe() change the original DataFrame?
No, pipe() returns a new DataFrame after applying the function, leaving the original unchanged, as seen by raw variable values in variable_tracker.
Why do we return the DataFrame inside each cleaning function?
Returning the DataFrame allows pipe() to pass the updated DataFrame to the next function, enabling chaining, as shown in the execution_table steps.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table, what is the 'Name' column value after step 1?
A[None]
B['BOB']
C['Alice']
D['alice']
💡 Hint
Check the 'Resulting DataFrame' column at step 1 in execution_table.
At which step does the 'Name' column become lowercase?
AStep 1
BStep 2
CStart
DEnd
💡 Hint
Look at the 'Action' and 'Resulting DataFrame' columns in execution_table for step 2.
If we skip the drop_missing function, what would happen to the DataFrame after pipe()?
ARows with missing values remain, 'Name' column lowercase applied
BAll rows removed
CDataFrame becomes empty
DNo changes at all
💡 Hint
Consider what drop_missing does in step 1 and what to_lowercase does in step 2.
Concept Snapshot
Use pipe() to chain cleaning functions on DataFrames.
Each function takes a DataFrame and returns a DataFrame.
pipe() passes the DataFrame through each function in order.
This creates clear, readable cleaning pipelines.
Example: df.pipe(func1).pipe(func2)
Full Transcript
We start with a raw DataFrame containing some missing values and mixed case names. We define two cleaning functions: one to drop rows with missing data, and another to convert the 'Name' column to lowercase. Using pipe(), we apply these functions one after another. After the first pipe step, rows with missing values are removed. After the second, the names are all lowercase. The original DataFrame remains unchanged, and the cleaned DataFrame is the final output. This method helps build clear, step-by-step cleaning pipelines.