Creating new columns in Pandas - Performance & Efficiency
When we add new columns to a pandas DataFrame, it takes some time to process. We want to understand how this time changes as the DataFrame gets bigger.
The question is: How does the time to create new columns grow when the number of rows increases?
Analyze the time complexity of the following code snippet.
import pandas as pd
n = 10 # Example value for n
df = pd.DataFrame({
'A': range(n),
'B': range(n, 2*n)
})
df['C'] = df['A'] + df['B']
This code creates a DataFrame with two columns and then adds a new column 'C' by adding columns 'A' and 'B' element-wise.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Adding values from columns 'A' and 'B' for each row.
- How many times: Once for each row in the DataFrame (n times).
As the number of rows grows, the time to add the new column grows roughly in direct proportion.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 10 additions |
| 100 | 100 additions |
| 1000 | 1000 additions |
Pattern observation: Doubling the rows doubles the work needed to create the new column.
Time Complexity: O(n)
This means the time to create a new column grows linearly with the number of rows in the DataFrame.
[X] Wrong: "Creating a new column is instant and does not depend on DataFrame size."
[OK] Correct: Each row must be processed to compute the new column values, so time grows with the number of rows.
Understanding how data size affects operations like adding columns helps you write efficient code and explain your choices clearly in real projects.
"What if we create the new column using a constant value instead of adding two columns? How would the time complexity change?"