Challenge - 5 Problems
One-Hot Encoding Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of one-hot encoding a small text corpus
What is the output of the following code that one-hot encodes a list of words?
NLP
from sklearn.preprocessing import OneHotEncoder import numpy as np words = np.array([['cat'], ['dog'], ['cat'], ['bird']]) encoder = OneHotEncoder(sparse=False) encoded = encoder.fit_transform(words) print(encoded)
Attempts:
2 left
💡 Hint
Remember that OneHotEncoder assigns columns in alphabetical order of unique words.
✗ Incorrect
The unique words are ['bird', 'cat', 'dog'] alphabetically. So 'bird' is [1,0,0], 'cat' is [0,1,0], 'dog' is [0,0,1]. The input words are ['cat', 'dog', 'cat', 'bird'], so the output matches option A.
🧠 Conceptual
intermediate1:30remaining
Understanding one-hot encoding vocabulary size
If you one-hot encode a text corpus with 10,000 unique words, what will be the size of each one-hot vector?
Attempts:
2 left
💡 Hint
One-hot encoding creates a vector with one position for each unique word.
✗ Incorrect
One-hot encoding creates a vector with length equal to the vocabulary size. Only one position is 1, representing the word, and all others are 0.
❓ Hyperparameter
advanced1:30remaining
Choosing one-hot encoding parameters for text data
Which parameter of sklearn's OneHotEncoder controls whether the output is a sparse matrix or a dense array?
Attempts:
2 left
💡 Hint
This parameter decides the output format to save memory or not.
✗ Incorrect
The 'sparse' parameter when set to True returns a sparse matrix, which saves memory for large vocabularies. False returns a dense numpy array.
❓ Metrics
advanced1:30remaining
Evaluating one-hot encoded text input for a classification model
You trained a classifier on one-hot encoded text data. Which metric best measures how well the model predicts the correct class labels?
Attempts:
2 left
💡 Hint
Think about classification performance metrics.
✗ Incorrect
Accuracy measures the fraction of correct predictions in classification tasks. Mean Squared Error is for regression, Silhouette Score for clustering, and Perplexity for language models.
🔧 Debug
expert2:00remaining
Debugging one-hot encoding with unseen words during inference
You trained a OneHotEncoder on a training set and saved it. At inference, you try to transform new text containing words not seen during training. What error will sklearn's OneHotEncoder raise by default?
Attempts:
2 left
💡 Hint
Check how OneHotEncoder handles unknown categories by default.
✗ Incorrect
By default, OneHotEncoder raises a ValueError if transform sees categories not present during fit. To avoid this, handle_unknown='ignore' can be set.