Adding and removing categories in Pandas - Time & Space Complexity
We want to understand how the time needed changes when we add or remove categories in a pandas categorical column.
How does the work grow as the number of categories or data size grows?
Analyze the time complexity of the following code snippet.
import pandas as pd
# Create a categorical column
cat = pd.Categorical(["a", "b", "a", "c"])
# Add a new category
cat = cat.add_categories(["d"])
# Remove a category
cat = cat.remove_categories(["b"])
This code creates a categorical column, then adds a new category, and finally removes an existing category.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Updating the internal categories list and mapping data to new categories.
- How many times: The operations scan through the categories and sometimes the data values once each.
When adding or removing categories, pandas updates the list of categories and may check data values to adjust mappings.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 categories, 1000 data points | Operations scale with categories and data size, roughly scanning categories and data once. |
| 100 categories, 10,000 data points | Operations increase roughly 10 times, scanning more categories and data. |
| 1000 categories, 100,000 data points | Operations increase roughly 100 times, scanning all categories and data. |
Pattern observation: The work grows roughly linearly with the number of categories and data points.
Time Complexity: O(n + k)
This means the time grows linearly with the number of data points (n) plus the number of categories (k).
[X] Wrong: "Adding or removing categories is instant and does not depend on data size."
[OK] Correct: Pandas must update category mappings and sometimes check all data points, so time depends on both categories and data size.
Understanding how operations scale with data size and categories helps you explain performance in real data tasks clearly and confidently.
"What if we add multiple categories at once instead of one by one? How would the time complexity change?"