Building cleaning pipelines with pipe() in Pandas - Time & Space Complexity
When we use pipe() in pandas, we chain data cleaning steps smoothly.
We want to know how the time to run these steps grows as the data gets bigger.
Analyze the time complexity of the following code snippet.
import pandas as pd
def clean_missing(df):
return df.dropna()
def convert_types(df):
return df.astype({'age': 'int'})
def filter_data(df):
return df[df['age'] > 20]
# Using pipe to chain cleaning steps
cleaned_df = (df.pipe(clean_missing)
.pipe(convert_types)
.pipe(filter_data))
This code chains three cleaning functions using pipe() on a DataFrame.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Each cleaning function processes the entire DataFrame rows.
- How many times: Three times, once per function in the pipeline.
Each function looks at all rows, so work grows as the number of rows grows.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 30 (3 functions x 10 rows) |
| 100 | About 300 (3 x 100) |
| 1000 | About 3000 (3 x 1000) |
Pattern observation: The total work grows roughly in direct proportion to the number of rows.
Time Complexity: O(n)
This means the time to clean grows linearly as the data size grows.
[X] Wrong: "Using pipe() makes the cleaning instant or faster regardless of data size."
[OK] Correct: pipe() just chains functions; each still processes all data, so time depends on data size.
Understanding how chaining cleaning steps affects time helps you explain your data preparation skills clearly.
"What if one cleaning function inside pipe() only processes a fixed number of columns instead of all rows? How would the time complexity change?"