0
0
Data Analysis Pythondata~5 mins

Encoding categorical variables in Data Analysis Python - Time & Space Complexity

Choose your learning style9 modes available
Time Complexity: Encoding categorical variables
O(n x k)
Understanding Time Complexity

When we convert categories into numbers, we want to know how long it takes as data grows.

How does the time needed change when we have more data or more categories?

Scenario Under Consideration

Analyze the time complexity of the following code snippet.

from sklearn.preprocessing import OneHotEncoder
import pandas as pd

data = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'blue', 'red'] * 1000
})

encoder = OneHotEncoder(sparse_output=False)
encoded = encoder.fit_transform(data[['color']])

This code turns a list of colors into a set of columns with 1s and 0s for each color.

Identify Repeating Operations

Identify the loops, recursion, array traversals that repeat.

  • Primary operation: Scanning each row to assign a 1 or 0 for each category.
  • How many times: Once for each data row times the number of unique categories.
How Execution Grows With Input

As the number of rows or categories grows, the work grows too.

Input Size (n rows)Approx. Operations
1010 x number of categories
100100 x number of categories
10001000 x number of categories

Pattern observation: The work grows in a straight line with the number of rows and categories.

Final Time Complexity

Time Complexity: O(n x k)

This means the time grows with both the number of data rows and the number of unique categories.

Common Mistake

[X] Wrong: "Encoding time depends only on the number of rows."

[OK] Correct: Because each row must be checked against all categories, more categories mean more work.

Interview Connect

Understanding how encoding scales helps you explain data preparation steps clearly and shows you think about efficiency.

Self-Check

"What if we used label encoding instead of one-hot encoding? How would the time complexity change?"