Pandasdata~5 mins

rank() method and ranking methods in Pandas - Time & Space Complexity

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Time Complexity: rank() method and ranking methods

O(n log n)

Understanding Time Complexity

We want to understand how the time needed to rank data grows as the data size grows.

How does pandas rank() method's speed change when we have more rows?

Scenario Under Consideration

Analyze the time complexity of the following code snippet.

import pandas as pd

df = pd.DataFrame({'score': [10, 20, 20, 30, 40]})
df['rank'] = df['score'].rank(method='average')

This code creates a DataFrame and ranks the 'score' column using the average ranking method.

Identify Repeating Operations

Identify the loops, recursion, array traversals that repeat.

Primary operation: Sorting the column values to determine rank order.
How many times: The sorting operation processes all rows once.
Additional steps: Assigning ranks involves scanning the sorted data once more.

How Execution Grows With Input

As the number of rows increases, the time to sort and assign ranks grows.

Input Size (n)	Approx. Operations
10	About 10 * log(10) ≈ 33 operations
100	About 100 * log(100) ≈ 664 operations
1000	About 1000 * log(1000) ≈ 9966 operations

Pattern observation: The operations grow a bit faster than the number of rows because sorting takes more time as data grows.

Final Time Complexity

Time Complexity: O(n log n)

This means the time to rank grows a little faster than the number of rows because sorting is involved.

Common Mistake

[X] Wrong: "Ranking is just a simple pass through the data, so it takes linear time O(n)."

[OK] Correct: Ranking needs sorting first, which takes more time than just one pass, so it is slower than O(n).

Interview Connect

Understanding how ranking scales helps you explain performance when working with large datasets in real projects.

Self-Check

What if we used a ranking method that does not require sorting, like assigning ranks based on a pre-sorted column? How would the time complexity change?