Memory savings with categoricals in Pandas - Time & Space Complexity
We want to understand how using categorical data affects the time it takes to work with data in pandas.
Specifically, how does changing data type to categorical impact the speed of operations?
Analyze the time complexity of the following code snippet.
import pandas as pd
# Create a large DataFrame with repeated string values
n = 1000000
values = ['apple', 'banana', 'cherry', 'date']
df = pd.DataFrame({'fruit': [values[i % 4] for i in range(n)]})
# Convert the 'fruit' column to categorical type
df['fruit_cat'] = df['fruit'].astype('category')
This code creates a large list of repeated fruit names and converts the column to a categorical type.
- Primary operation: Creating the list of fruit names by repeating 4 values n times.
- How many times: The list is built with n = 1,000,000 items.
- Conversion operation: Mapping each string to a category code during the type conversion.
- How many times: Each of the n items is processed once to assign a category code.
As the number of rows n grows, the time to create the list and convert to categorical grows roughly in a straight line.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 string assignments and 10 category mappings |
| 100 | About 100 string assignments and 100 category mappings |
| 1000 | About 1000 string assignments and 1000 category mappings |
Pattern observation: The work grows directly with the number of rows, so doubling rows doubles work.
Time Complexity: O(n)
This means the time to convert to categorical grows linearly with the number of rows in the data.
[X] Wrong: "Converting to categorical makes operations instantly faster regardless of data size."
[OK] Correct: The conversion itself takes time proportional to data size because each item is processed once. The speedup happens later in some operations, not during conversion.
Understanding how data type changes affect processing time helps you write efficient data code and explain your choices clearly.
What if we changed the number of unique values from 4 to 1000? How would the time complexity of converting to categorical change?