Category codes and labels in Pandas - Time & Space Complexity
We want to understand how the time needed to get category codes and labels changes as the data grows.
How does the work grow when we have more data rows?
Analyze the time complexity of the following code snippet.
import pandas as pd
# Create a categorical column
cats = pd.Categorical(['apple', 'banana', 'apple', 'orange', 'banana'])
# Get the integer codes for categories
codes = cats.codes
# Get the category labels
labels = cats.categories
This code creates a categorical data column, then extracts the integer codes and the category labels.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Assigning codes to each data row by matching its category.
- How many times: Once for each row in the data (n times).
As the number of rows grows, the work to assign codes grows roughly the same amount.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 code assignments |
| 100 | About 100 code assignments |
| 1000 | About 1000 code assignments |
Pattern observation: The work grows directly with the number of rows.
Time Complexity: O(n)
This means the time to get codes grows in a straight line as the data gets bigger.
[X] Wrong: "Getting category codes is instant no matter how big the data is."
[OK] Correct: Even though categories are fixed, assigning codes must check each row, so time grows with data size.
Understanding how category codes work helps you explain data processing speed clearly, a useful skill in real projects and interviews.
"What if we had many more unique categories? How would that affect the time complexity of getting codes?"