0
0
Pandasdata~5 mins

sort_values() by multiple columns in Pandas - Time & Space Complexity

Choose your learning style9 modes available
Time Complexity: sort_values() by multiple columns
O(n log n)
Understanding Time Complexity

We want to understand how the time needed to sort data changes as the data grows.

Specifically, how sorting by multiple columns affects the work done.

Scenario Under Consideration

Analyze the time complexity of the following code snippet.

import pandas as pd

df = pd.DataFrame({
    'A': [3, 1, 2, 1],
    'B': [4, 3, 2, 1],
    'C': [10, 20, 30, 40]
})

sorted_df = df.sort_values(by=['A', 'B'])

This code sorts the DataFrame first by column 'A', then by column 'B' to break ties.

Identify Repeating Operations

Identify the loops, recursion, array traversals that repeat.

  • Primary operation: Comparing and rearranging rows based on column values.
  • How many times: The sorting algorithm compares elements multiple times, roughly proportional to the number of rows times the logarithm of the number of rows.
How Execution Grows With Input

As the number of rows grows, the sorting work grows faster than just the number of rows.

Input Size (n)Approx. Operations
10About 30 to 40 comparisons
100About 600 to 700 comparisons
1000About 10,000 to 12,000 comparisons

Pattern observation: The number of operations grows a bit faster than the number of rows, roughly like n times log n.

Final Time Complexity

Time Complexity: O(n log n)

This means sorting takes more time as data grows, but not as fast as checking every pair; it grows in a balanced way.

Common Mistake

[X] Wrong: "Sorting by multiple columns multiplies the time by the number of columns."

[OK] Correct: Sorting still mainly depends on the number of rows; adding columns to sort by only changes how comparisons are done, not the overall growth rate.

Interview Connect

Understanding how sorting scales helps you explain data processing steps clearly and shows you know how data size affects performance.

Self-Check

"What if we sorted by only one column instead of multiple? How would the time complexity change?"