Pipe for method chaining in Data Analysis Python - Time & Space Complexity
We want to understand how using a pipe for method chaining affects the time it takes to run data operations.
Specifically, how does the total work grow when we chain multiple methods together?
Analyze the time complexity of the following code snippet.
import pandas as pd
df = pd.DataFrame({'A': range(1000), 'B': range(1000, 2000)})
result = (df
.assign(C=lambda x: x['A'] + x['B'])
.query('C > 1500')
.sort_values('C'))
This code creates a DataFrame, adds a new column by adding two columns, filters rows, and sorts the result using method chaining.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Each method (assign, query, sort_values) processes the DataFrame rows.
- How many times: Each method runs once, but each touches all rows (about n rows).
Each method looks at all rows, so the work grows as the number of rows grows.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 3 x 10 = 30 |
| 100 | About 3 x 100 = 300 |
| 1000 | About 3 x 1000 = 3000 |
Pattern observation: The total work grows roughly in a straight line with input size, multiplied by the number of chained methods.
Time Complexity: O(k x n log n)
This means the time grows with the number of rows n and the number of methods k, with sorting adding a log factor.
[X] Wrong: "Chaining methods runs all operations instantly and costs the same as one method."
[OK] Correct: Each method processes the data separately, so the total time adds up, especially sorting which is slower.
Understanding how chaining affects time helps you write efficient data code and explain your choices clearly in interviews.
"What if we replaced the sort_values method with a method that only selects a few rows? How would the time complexity change?"