Selecting columns by name in Pandas - Time & Space Complexity
When we select columns by their names in pandas, we want to know how the time it takes changes as the data grows.
We ask: How does the work increase when the number of rows or columns grows?
Analyze the time complexity of the following code snippet.
import pandas as pd
df = pd.DataFrame({
'A': range(1000),
'B': range(1000, 2000),
'C': range(2000, 3000)
})
selected = df[['A', 'C']]
This code creates a DataFrame with 3 columns and selects two columns by their names.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Accessing and copying the selected columns from the DataFrame.
- How many times: Once per row for each selected column.
As the number of rows grows, the work to copy selected columns grows proportionally.
| Input Size (n rows) | Approx. Operations |
|---|---|
| 10 | 20 (2 columns x 10 rows) |
| 100 | 200 (2 columns x 100 rows) |
| 1000 | 2000 (2 columns x 1000 rows) |
Pattern observation: The operations grow linearly with the number of rows.
Time Complexity: O(n)
This means the time to select columns grows directly with the number of rows in the DataFrame.
[X] Wrong: "Selecting columns by name is instant and does not depend on data size."
[OK] Correct: Even though column names are used, pandas must copy data for each row in those columns, so time grows with rows.
Understanding how data selection scales helps you write efficient code and explain your choices clearly in real projects.
"What if we select all columns instead of just a few? How would the time complexity change?"