0
0
Pandasdata~5 mins

Memory savings with categoricals in Pandas - Time & Space Complexity

Choose your learning style9 modes available
Time Complexity: Memory savings with categoricals
O(n)
Understanding Time Complexity

We want to understand how using categorical data affects the time it takes to work with data in pandas.

Specifically, how does changing data type to categorical impact the speed of operations?

Scenario Under Consideration

Analyze the time complexity of the following code snippet.

import pandas as pd

# Create a large DataFrame with repeated string values
n = 1000000
values = ['apple', 'banana', 'cherry', 'date']
df = pd.DataFrame({'fruit': [values[i % 4] for i in range(n)]})

# Convert the 'fruit' column to categorical type
df['fruit_cat'] = df['fruit'].astype('category')

This code creates a large list of repeated fruit names and converts the column to a categorical type.

Identify Repeating Operations
  • Primary operation: Creating the list of fruit names by repeating 4 values n times.
  • How many times: The list is built with n = 1,000,000 items.
  • Conversion operation: Mapping each string to a category code during the type conversion.
  • How many times: Each of the n items is processed once to assign a category code.
How Execution Grows With Input

As the number of rows n grows, the time to create the list and convert to categorical grows roughly in a straight line.

Input Size (n)Approx. Operations
10About 10 string assignments and 10 category mappings
100About 100 string assignments and 100 category mappings
1000About 1000 string assignments and 1000 category mappings

Pattern observation: The work grows directly with the number of rows, so doubling rows doubles work.

Final Time Complexity

Time Complexity: O(n)

This means the time to convert to categorical grows linearly with the number of rows in the data.

Common Mistake

[X] Wrong: "Converting to categorical makes operations instantly faster regardless of data size."

[OK] Correct: The conversion itself takes time proportional to data size because each item is processed once. The speedup happens later in some operations, not during conversion.

Interview Connect

Understanding how data type changes affect processing time helps you write efficient data code and explain your choices clearly.

Self-Check

What if we changed the number of unique values from 4 to 1000? How would the time complexity of converting to categorical change?