0
0
Pandasdata~5 mins

Building cleaning pipelines with pipe() in Pandas

Choose your learning style9 modes available
Introduction

Using pipe() helps you clean data step-by-step in a clear and organized way. It makes your code easier to read and reuse.

When you want to apply multiple cleaning steps to a dataset in order.
When you want to keep your data cleaning code neat and easy to follow.
When you want to reuse cleaning functions on different datasets.
When you want to avoid writing long chains of commands that are hard to read.
When you want to share your cleaning steps with others clearly.
Syntax
Pandas
cleaned_data = (dataframe
    .pipe(function1, arg1, arg2)
    .pipe(function2)
    .pipe(function3, arg3=value3)
)

pipe() passes the DataFrame to the function as the first argument.

You can add extra arguments to the function after the DataFrame inside pipe().

Examples
This example drops rows with missing values using a custom function with pipe().
Pandas
def drop_missing(df):
    return df.dropna()

cleaned = df.pipe(drop_missing)
This example selects specific columns by passing extra arguments through pipe().
Pandas
def select_columns(df, cols):
    return df[cols]

cleaned = df.pipe(select_columns, ['A', 'B'])
This chains two cleaning steps clearly using pipe().
Pandas
cleaned = (df
    .pipe(drop_missing)
    .pipe(select_columns, ['A', 'B'])
)
Sample Program

This program creates a small dataset with missing values and extra columns. It then cleans the data by dropping rows with missing values, selecting only the 'Name' and 'Age' columns, and renaming them. The pipe() method makes the steps easy to read and follow.

Pandas
import pandas as pd

def drop_missing(df):
    return df.dropna()

def select_columns(df, cols):
    return df[cols]

def rename_columns(df, new_names):
    return df.rename(columns=new_names)

# Sample data with missing values and extra columns
 data = pd.DataFrame({
    'Name': ['Alice', 'Bob', None, 'David'],
    'Age': [25, None, 30, 22],
    'City': ['NY', 'LA', 'SF', 'LA'],
    'Score': [85, 90, 88, 92]
})

# Cleaning pipeline using pipe()
cleaned_data = (data
    .pipe(drop_missing)
    .pipe(select_columns, ['Name', 'Age'])
    .pipe(rename_columns, {'Name': 'Full Name', 'Age': 'Age Years'})
)

print(cleaned_data)
OutputSuccess
Important Notes

Each function used with pipe() should take a DataFrame as the first argument and return a DataFrame.

You can add as many cleaning steps as you want by chaining pipe() calls.

This method helps keep your data cleaning code modular and reusable.

Summary

pipe() helps you write clear, step-by-step data cleaning code.

You can chain multiple cleaning functions easily with pipe().

Functions used with pipe() should accept and return DataFrames.