loc for label-based selection in Pandas - Time & Space Complexity
We want to understand how the time needed to select data using loc changes as the data size grows.
How does the selection time grow when we pick rows or columns by their labels?
Analyze the time complexity of the following code snippet.
import pandas as pd
# Create a DataFrame with n rows and 3 columns
n = 1000
df = pd.DataFrame({
'A': range(n),
'B': range(n, 2*n),
'C': range(2*n, 3*n)
})
# Select rows with labels from 100 to 199
subset = df.loc[100:199, ['A', 'B']]
This code creates a DataFrame and selects a slice of rows by label and specific columns using loc.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Accessing rows and columns by label involves scanning the index labels to find start and end positions.
- How many times: The operation depends on the number of rows selected (here 100) and the number of columns selected (2).
When selecting a range of rows by label, the time grows roughly with how many rows you pick.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 rows x 2 columns = 20 operations |
| 100 | About 100 rows x 2 columns = 200 operations |
| 1000 | About 100 rows x 2 columns = 200 operations |
Pattern observation: The operations grow linearly with the number of rows selected.
Time Complexity: O(k)
This means the time grows linearly with the number of rows you select, where k is the size of the selection.
[X] Wrong: "Selecting rows by label with loc always takes constant time regardless of selection size."
[OK] Correct: Actually, selecting more rows means more data to access and copy, so time grows with the number of rows chosen.
Understanding how data selection time grows helps you write efficient code and explain your choices clearly in real projects and interviews.
"What if we changed the selection to pick a single row instead of a range? How would the time complexity change?"