Ordered categories in Pandas - Time & Space Complexity
We want to understand how the time needed to work with ordered categories changes as the data grows.
Specifically, how does sorting or comparing ordered categories scale with more data?
Analyze the time complexity of the following code snippet.
import pandas as pd
# Create a categorical column with order
cats = pd.Categorical(["low", "medium", "high", "medium", "low"],
categories=["low", "medium", "high"], ordered=True)
# Create a DataFrame
df = pd.DataFrame({"priority": cats})
# Sort the DataFrame by the ordered category
sorted_df = df.sort_values(by="priority")
This code creates an ordered categorical column and sorts the DataFrame by that column.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Sorting the DataFrame rows based on the ordered category column.
- How many times: The sorting algorithm compares elements multiple times, roughly proportional to n log n, where n is the number of rows.
As the number of rows increases, the sorting work grows faster than just adding rows.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 30 comparisons |
| 100 | About 700 comparisons |
| 1000 | About 10,000 comparisons |
Pattern observation: The number of operations grows faster than the number of rows, roughly like n times log n.
Time Complexity: O(n log n)
This means that sorting ordered categories takes more time as data grows, but not as fast as checking every pair.
[X] Wrong: "Sorting ordered categories is just as fast as scanning the data once."
[OK] Correct: Sorting needs multiple comparisons and rearrangements, which take more time than just looking at each item once.
Understanding how sorting ordered categories scales helps you explain data processing steps clearly and shows you know how data size affects performance.
"What if the categories were unordered? How would sorting by that column affect time complexity?"