Feature engineering basics in Pandas - Time & Space Complexity
When creating new features from data, it is important to know how the time to do this grows as the data gets bigger.
We want to understand how the work needed changes when we add more rows or columns.
Analyze the time complexity of the following code snippet.
import pandas as pd
n = 10 # Example value for n
df = pd.DataFrame({
'A': range(1, n+1),
'B': range(n, 0, -1)
})
df['C'] = df['A'] + df['B']
df['D'] = df['A'] * 2
This code creates two new columns by adding and multiplying existing columns for each row.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Adding and multiplying values for each row in the DataFrame.
- How many times: Once per row, so n times where n is the number of rows.
As the number of rows grows, the number of operations grows roughly the same amount.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 20 (2 operations per row) |
| 100 | About 200 |
| 1000 | About 2000 |
Pattern observation: The work grows directly with the number of rows.
Time Complexity: O(n)
This means the time to create new features grows in a straight line as the data size grows.
[X] Wrong: "Creating new features takes the same time no matter how big the data is."
[OK] Correct: Each row needs to be processed, so more rows mean more work and more time.
Understanding how feature creation scales helps you explain your data preparation steps clearly and shows you know how to handle bigger datasets.
"What if we created new features using pairs of rows instead of single rows? How would the time complexity change?"