Sorting MultiIndex in Pandas - Time & Space Complexity
Sorting a MultiIndex in pandas means arranging the rows based on multiple levels of labels.
We want to know how the time needed to sort grows as the data size grows.
Analyze the time complexity of the following code snippet.
import pandas as pd
# Create a MultiIndex DataFrame
index = pd.MultiIndex.from_tuples([
('a', 3), ('a', 1), ('b', 2), ('b', 1), ('a', 2)
], names=['letter', 'number'])
df = pd.DataFrame({'value': [10, 20, 30, 40, 50]}, index=index)
# Sort the MultiIndex
sorted_df = df.sort_index()
This code creates a DataFrame with a MultiIndex and sorts it by all index levels.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Comparing and rearranging all rows based on multiple index levels.
- How many times: Each row is compared multiple times during sorting, depending on the sorting algorithm.
Sorting time grows faster than just adding more rows; it grows a bit more than linearly.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 30 to 40 comparisons |
| 100 | About 700 to 800 comparisons |
| 1000 | About 10,000 to 12,000 comparisons |
Pattern observation: The number of operations grows roughly like n log n, meaning it grows faster than the number of rows but not as fast as their square.
Time Complexity: O(n log n)
This means sorting takes more time as data grows, but it stays efficient enough for large data.
[X] Wrong: "Sorting a MultiIndex is just like scanning through the data once, so it should be O(n)."
[OK] Correct: Sorting requires comparing and rearranging rows multiple times, which takes more than just one pass through the data.
Understanding how sorting scales helps you explain your choices when working with complex data structures like MultiIndex in pandas.
"What if we sorted only one level of the MultiIndex instead of all levels? How would the time complexity change?"