Data Analysis Pythondata~5 mins

One-hot encoding in Data Analysis Python - Time & Space Complexity

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Time Complexity: One-hot encoding

O(n * k)

Understanding Time Complexity

We want to understand how the time needed to do one-hot encoding changes as the data grows.

How does the work increase when we have more data or more categories?

Scenario Under Consideration

Analyze the time complexity of the following code snippet.


import pandas as pd

def one_hot_encode(df, column):
    categories = df[column].unique()
    for category in categories:
        df[f'{column}_' + str(category)] = (df[column] == category).astype(int)
    return df

This code creates new columns for each unique category and marks rows with 1 or 0 depending on the category.

Identify Repeating Operations

Primary operation: Loop over each unique category and then compare each row to that category.
How many times: For each category, it checks all rows once.

How Execution Grows With Input

As the number of rows and categories grow, the work grows by multiplying these two.

Input Size (n rows)	Approx. Operations (with k categories)
10 rows, 3 categories	About 30 checks
100 rows, 3 categories	About 300 checks
1000 rows, 3 categories	About 3000 checks

Pattern observation: The work grows roughly by multiplying the number of rows by the number of categories.

Final Time Complexity

Time Complexity: O(n * k)

This means the time grows proportionally to the number of rows times the number of unique categories.

Common Mistake

[X] Wrong: "One-hot encoding time depends only on the number of rows."

[OK] Correct: The number of unique categories also affects time because we create a new column and check each row for each category.

Interview Connect

Understanding how one-hot encoding scales helps you explain data preparation steps clearly and shows you can think about efficiency in real projects.

Self-Check

"What if we used a sparse matrix representation instead of full columns? How would the time complexity change?"