Why categorical type matters in Pandas - Performance Analysis
We want to see how using the categorical type in pandas affects the time it takes to do operations.
How does changing data type change the work pandas must do?
Analyze the time complexity of the following code snippet.
import pandas as pd
# Create a large DataFrame with repeated categories
n = 100000
categories = ['apple', 'banana', 'cherry']
data = pd.DataFrame({
'fruit': ['apple', 'banana', 'cherry'] * (n // 3)
})
# Convert to categorical type
cat_data = data['fruit'].astype('category')
# Count the occurrences of each category
counts = cat_data.value_counts()
This code creates a large list of fruits, converts it to a categorical type, and counts how many times each fruit appears.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Counting occurrences by scanning all items in the column.
- How many times: Once over all n rows to count.
Counting needs to look at each item once, so work grows as the list grows.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 10 checks |
| 100 | 100 checks |
| 1000 | 1000 checks |
Pattern observation: The work grows directly with the number of rows.
Time Complexity: O(n)
This means the time to count grows in a straight line as the data size grows.
[X] Wrong: "Using categorical type makes counting instant or constant time."
[OK] Correct: Even with categories, pandas must still look at every item once to count them all.
Understanding how data types affect operation speed helps you write faster code and explain your choices clearly.
"What if we had many more unique categories instead of just a few? How would the time complexity change?"