One-hot encoding in Data Analysis Python - Time & Space Complexity
We want to understand how the time needed to do one-hot encoding changes as the data grows.
How does the work increase when we have more data or more categories?
Analyze the time complexity of the following code snippet.
import pandas as pd
def one_hot_encode(df, column):
categories = df[column].unique()
for category in categories:
df[f'{column}_' + str(category)] = (df[column] == category).astype(int)
return df
This code creates new columns for each unique category and marks rows with 1 or 0 depending on the category.
- Primary operation: Loop over each unique category and then compare each row to that category.
- How many times: For each category, it checks all rows once.
As the number of rows and categories grow, the work grows by multiplying these two.
| Input Size (n rows) | Approx. Operations (with k categories) |
|---|---|
| 10 rows, 3 categories | About 30 checks |
| 100 rows, 3 categories | About 300 checks |
| 1000 rows, 3 categories | About 3000 checks |
Pattern observation: The work grows roughly by multiplying the number of rows by the number of categories.
Time Complexity: O(n * k)
This means the time grows proportionally to the number of rows times the number of unique categories.
[X] Wrong: "One-hot encoding time depends only on the number of rows."
[OK] Correct: The number of unique categories also affects time because we create a new column and check each row for each category.
Understanding how one-hot encoding scales helps you explain data preparation steps clearly and shows you can think about efficiency in real projects.
"What if we used a sparse matrix representation instead of full columns? How would the time complexity change?"