Converting to categorical in Pandas - Time & Space Complexity
We want to understand how long it takes to convert a column in a DataFrame to a categorical type.
How does the time needed change when the data size grows?
Analyze the time complexity of the following code snippet.
import pandas as pd
df = pd.DataFrame({
'color': ['red', 'blue', 'green', 'blue', 'red'] * 1000
})
df['color_cat'] = df['color'].astype('category')
This code creates a DataFrame with repeated color names and converts the 'color' column to a categorical type.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Scanning each value in the column to assign category codes.
- How many times: Once for each row in the column (n times).
As the number of rows increases, the time to convert grows roughly in direct proportion.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 checks and assignments |
| 100 | About 100 checks and assignments |
| 1000 | About 1000 checks and assignments |
Pattern observation: Doubling the input roughly doubles the work needed.
Time Complexity: O(n)
This means the time to convert grows linearly with the number of rows in the column.
[X] Wrong: "Converting to categorical is instant no matter the data size."
[OK] Correct: The operation must look at each value to assign categories, so it takes longer with more data.
Understanding how data type conversions scale helps you write efficient data processing code in real projects.
"What if the column already has only a few unique values? How would that affect the time complexity?"