sort_values() by single column in Pandas - Time & Space Complexity
We want to understand how the time needed to sort data changes as the data grows.
How does sorting by one column get slower or faster with more rows?
Analyze the time complexity of the following code snippet.
import pandas as pd
df = pd.DataFrame({
'A': [5, 2, 9, 1, 7],
'B': [3, 8, 6, 4, 0]
})
sorted_df = df.sort_values(by='A')
This code sorts the DataFrame rows based on the values in column 'A'.
- Primary operation: Comparing and rearranging rows based on column 'A' values.
- How many times: The sorting algorithm compares elements multiple times, depending on the number of rows.
Sorting takes more time as the number of rows increases, but not in a simple doubling way.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 30 to 50 comparisons |
| 100 | About 500 to 700 comparisons |
| 1000 | About 10,000 to 15,000 comparisons |
Pattern observation: The number of operations grows faster than the number of rows, roughly like rows times log of rows.
Time Complexity: O(n log n)
This means sorting takes a bit more than just looking at each row once; it grows a little faster as the data grows.
[X] Wrong: "Sorting by one column is always as fast as just looking through the data once."
[OK] Correct: Sorting needs to compare and reorder many rows, which takes more steps than just reading each row once.
Knowing how sorting time grows helps you explain your code choices clearly and shows you understand how data size affects performance.
"What if we sorted by two columns instead of one? How would the time complexity change?"