DataFrame as labeled two-dimensional table in Pandas - Time & Space Complexity
We want to understand how the time needed to work with a DataFrame grows as the table gets bigger.
Specifically, how does the time to create and access data in a labeled two-dimensional table change with size?
Analyze the time complexity of the following code snippet.
import pandas as pd
# Create a DataFrame with n rows and 3 columns
n = 1000
data = {
'A': range(n),
'B': range(n, 2*n),
'C': range(2*n, 3*n)
}
df = pd.DataFrame(data)
# Access a column by label
col_a = df['A']
# Access a row by label
row_10 = df.loc[10]
This code creates a DataFrame with labeled rows and columns, then accesses one column and one row by their labels.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Creating the DataFrame involves building arrays for each column with n elements.
- How many times: Each column array is created once with n elements, so operations scale with n.
- Accessing a column or row by label is a direct lookup, done once each here.
As the number of rows n grows, creating the DataFrame takes longer because it builds arrays of size n for each column.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 30 operations (3 columns x 10 rows) |
| 100 | About 300 operations |
| 1000 | About 3000 operations |
Pattern observation: The operations grow roughly in direct proportion to the number of rows.
Time Complexity: O(n)
This means the time to create and access data grows linearly with the number of rows in the DataFrame.
[X] Wrong: "Accessing a column or row by label takes time proportional to the number of rows."
[OK] Correct: Access by label uses fast lookup methods, so it takes about the same time no matter how many rows there are.
Understanding how DataFrame operations scale helps you explain your code choices clearly and shows you know how data size affects performance.
"What if we added many more columns instead of rows? How would the time complexity change?"