Why sorting and ranking matter in Pandas - Performance Analysis
Sorting and ranking are common tasks in data science to organize data meaningfully.
We want to know how the time needed changes as data grows bigger.
Analyze the time complexity of the following code snippet.
import pandas as pd
data = pd.DataFrame({
'score': [88, 92, 79, 93, 85]
})
sorted_data = data.sort_values(by='score')
data['rank'] = data['score'].rank(method='min')
This code sorts a small table by scores and then assigns ranks to each score.
- Primary operation: Sorting the list of scores.
- How many times: The sorting algorithm compares and moves elements multiple times depending on data size.
As the number of scores grows, sorting takes more time, but ranking after sorting is faster.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 30 to 40 operations |
| 100 | About 700 to 800 operations |
| 1000 | About 10,000 to 12,000 operations |
Pattern observation: The operations grow faster than the input size but not as fast as the square of input size.
Time Complexity: O(n log n)
This means the time needed grows a bit faster than the number of items but stays manageable even for large data.
[X] Wrong: "Sorting takes the same time no matter how many items there are."
[OK] Correct: Sorting compares many pairs of items, so more items mean more comparisons and longer time.
Understanding how sorting and ranking scale helps you explain your choices clearly when working with data in real projects.
"What if we used a simpler ranking method without sorting first? How would the time complexity change?"