Categorical data type optimization in Data Analysis Python - Time & Space Complexity
When working with large data tables, converting text columns to categorical types can speed up operations.
We want to know how the time to convert grows as the data size grows.
Analyze the time complexity of the following code snippet.
import pandas as pd
def optimize_categorical(df, col):
df[col] = df[col].astype('category')
return df
# Example usage:
# df = optimize_categorical(df, 'city')
This code changes a column's data type to categorical to save memory and speed up some operations.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Scanning the entire column to find unique values and assign codes.
- How many times: Once over all rows in the column.
The time to convert grows roughly in direct proportion to the number of rows.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 checks and assignments |
| 100 | About 100 checks and assignments |
| 1000 | About 1000 checks and assignments |
Pattern observation: Doubling the rows roughly doubles the work done.
Time Complexity: O(n)
This means the time to convert grows linearly with the number of rows in the column.
[X] Wrong: "Converting to categorical is instant no matter the data size."
[OK] Correct: The process scans all rows to find unique values, so bigger data takes more time.
Understanding how data type changes affect speed helps you work efficiently with big data in real projects.
"What if the column already has very few unique values? How would that affect the time complexity?"