Why indexing matters in Pandas - Performance Analysis
We want to see how using an index in pandas affects how fast operations run.
Does having an index make searching or selecting data faster as the data grows?
Analyze the time complexity of the following code snippet.
import pandas as pd
# Create a DataFrame with 1 million rows
n = 1_000_000
df = pd.DataFrame({
'id': range(n),
'value': range(n)
})
# Set 'id' as index
indexed_df = df.set_index('id')
# Select a row by index label
result = indexed_df.loc[500_000]
This code creates a large DataFrame, sets an index on the 'id' column, and selects a row by that index.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Searching for a row by index label using the index structure.
- How many times: The search happens once, but the cost depends on how the index is built and how many rows there are.
When using an index, searching for a row is much faster than scanning all rows.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 3-4 steps |
| 100 | About 7 steps |
| 1,000,000 | About 20 steps |
Pattern observation: The number of steps grows slowly as data grows, not one-by-one.
Time Complexity: O(log n)
This means finding a row by index label takes only a few steps even if the data is very large.
[X] Wrong: "Searching by index is as slow as scanning all rows one by one."
[OK] Correct: Because pandas uses a special structure for the index, it can jump quickly to the right row without checking every row.
Knowing how indexing speeds up data access shows you understand how to handle big data efficiently, a key skill in data science.
"What if we select a row without setting an index first? How would the time complexity change?"