Label encoding in Data Analysis Python - Time & Space Complexity
We want to know how the time needed to convert categories into numbers changes as the data grows.
How does the work increase when we have more items to encode?
Analyze the time complexity of the following code snippet.
from sklearn.preprocessing import LabelEncoder
def encode_labels(data):
encoder = LabelEncoder()
encoded = encoder.fit_transform(data)
return encoded
sample_data = ['cat', 'dog', 'bird', 'cat', 'dog']
encoded_result = encode_labels(sample_data)
This code changes a list of categories into numbers using label encoding.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Scanning the list of categories to assign numbers.
- How many times: Once over all items in the list.
As the list gets longer, the time to encode grows roughly in direct proportion.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 checks and assignments |
| 100 | About 100 checks and assignments |
| 1000 | About 1000 checks and assignments |
Pattern observation: Doubling the input roughly doubles the work.
Time Complexity: O(n)
This means the time to encode grows in a straight line with the number of items.
[X] Wrong: "Label encoding takes the same time no matter how many items there are."
[OK] Correct: The encoder must look at each item once, so more items mean more work.
Understanding how encoding scales helps you explain data preparation steps clearly and shows you know how data size affects processing time.
"What if we used one-hot encoding instead of label encoding? How would the time complexity change?"