sort_values() by multiple columns in Pandas - Time & Space Complexity
We want to understand how the time needed to sort data changes as the data grows.
Specifically, how sorting by multiple columns affects the work done.
Analyze the time complexity of the following code snippet.
import pandas as pd
df = pd.DataFrame({
'A': [3, 1, 2, 1],
'B': [4, 3, 2, 1],
'C': [10, 20, 30, 40]
})
sorted_df = df.sort_values(by=['A', 'B'])
This code sorts the DataFrame first by column 'A', then by column 'B' to break ties.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Comparing and rearranging rows based on column values.
- How many times: The sorting algorithm compares elements multiple times, roughly proportional to the number of rows times the logarithm of the number of rows.
As the number of rows grows, the sorting work grows faster than just the number of rows.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 30 to 40 comparisons |
| 100 | About 600 to 700 comparisons |
| 1000 | About 10,000 to 12,000 comparisons |
Pattern observation: The number of operations grows a bit faster than the number of rows, roughly like n times log n.
Time Complexity: O(n log n)
This means sorting takes more time as data grows, but not as fast as checking every pair; it grows in a balanced way.
[X] Wrong: "Sorting by multiple columns multiplies the time by the number of columns."
[OK] Correct: Sorting still mainly depends on the number of rows; adding columns to sort by only changes how comparisons are done, not the overall growth rate.
Understanding how sorting scales helps you explain data processing steps clearly and shows you know how data size affects performance.
"What if we sorted by only one column instead of multiple? How would the time complexity change?"