Sorting by values in Pandas - Time & Space Complexity
When we sort data in pandas, we want to know how the time it takes changes as the data grows.
We ask: How does sorting time increase when we have more rows?
Analyze the time complexity of the following code snippet.
import pandas as pd
df = pd.DataFrame({
'A': [5, 2, 9, 1, 7],
'B': [3, 8, 4, 6, 0]
})
sorted_df = df.sort_values(by='A')
This code creates a small table and sorts it by the values in column 'A'.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Comparing and rearranging rows based on column 'A' values.
- How many times: The sorting algorithm compares elements multiple times, roughly proportional to the number of rows times the logarithm of the number of rows.
As the number of rows grows, the sorting work grows a bit faster than just the number of rows.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 30 to 40 comparisons |
| 100 | About 600 to 700 comparisons |
| 1000 | About 10,000 to 12,000 comparisons |
Pattern observation: The work grows faster than just the number of rows, but not as fast as the square of rows.
Time Complexity: O(n log n)
This means sorting takes more time as data grows, but it grows in a balanced way, not too slow or too fast.
[X] Wrong: "Sorting time grows exactly like the number of rows (linear)."
[OK] Correct: Sorting needs to compare many pairs, so it grows a bit faster than just the number of rows.
Understanding sorting time helps you explain how data operations scale, a useful skill when working with real datasets.
"What if we sorted by multiple columns instead of one? How would the time complexity change?"