Data Analysis Pythondata~5 mins

Categorical data type optimization in Data Analysis Python - Time & Space Complexity

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Time Complexity: Categorical data type optimization

O(n)

Understanding Time Complexity

When working with large data tables, converting text columns to categorical types can speed up operations.

We want to know how the time to convert grows as the data size grows.

Scenario Under Consideration

Analyze the time complexity of the following code snippet.

import pandas as pd

def optimize_categorical(df, col):
    df[col] = df[col].astype('category')
    return df

# Example usage:
# df = optimize_categorical(df, 'city')

This code changes a column's data type to categorical to save memory and speed up some operations.

Identify Repeating Operations

Identify the loops, recursion, array traversals that repeat.

Primary operation: Scanning the entire column to find unique values and assign codes.
How many times: Once over all rows in the column.

How Execution Grows With Input

The time to convert grows roughly in direct proportion to the number of rows.

Input Size (n)	Approx. Operations
10	About 10 checks and assignments
100	About 100 checks and assignments
1000	About 1000 checks and assignments

Pattern observation: Doubling the rows roughly doubles the work done.

Final Time Complexity

Time Complexity: O(n)

This means the time to convert grows linearly with the number of rows in the column.

Common Mistake

[X] Wrong: "Converting to categorical is instant no matter the data size."

[OK] Correct: The process scans all rows to find unique values, so bigger data takes more time.

Interview Connect

Understanding how data type changes affect speed helps you work efficiently with big data in real projects.

Self-Check

"What if the column already has very few unique values? How would that affect the time complexity?"