Jump into concepts and practice - no test required
or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Recall & Review
beginner
What is one-hot encoding in machine learning?
One-hot encoding is a way to turn categories into numbers by making a new column for each category. Each row has a 1 in the column of its category and 0 in others.
Click to reveal answer
beginner
Why do we use one-hot encoding instead of just numbers for categories?
Using numbers alone can confuse the model into thinking some categories are bigger or better. One-hot encoding treats all categories equally without order.
Click to reveal answer
beginner
How does one-hot encoding handle a category called 'Red' in a color feature with options Red, Blue, Green?
It creates three columns: Red, Blue, Green. For 'Red', the encoded row is [1, 0, 0].
Click to reveal answer
intermediate
What is a potential downside of one-hot encoding when there are many categories?
It can create many new columns, making data big and slow to work with. This is called the 'curse of dimensionality'.
Click to reveal answer
beginner
Can one-hot encoding be used for numerical features?
No, one-hot encoding is for categorical features only. Numerical features are used as they are or scaled differently.
Click to reveal answer
What does one-hot encoding do to a categorical feature?
ASorts categories alphabetically
BCreates a new column for each category with 1 or 0 values
CReplaces categories with random numbers
DRemoves categories from the data
✗ Incorrect
One-hot encoding creates a new column for each category and marks 1 where the category is present, 0 otherwise.
Why is one-hot encoding preferred over assigning numbers like 1, 2, 3 to categories?
ABecause numbers can imply order or size which may mislead the model
BBecause numbers take more memory
CBecause numbers are harder to compute
DBecause numbers are not allowed in machine learning
✗ Incorrect
Assigning numbers can make the model think some categories are bigger or better, which is not true for categories.
If a feature has 5 categories, how many columns will one-hot encoding create?
A5
B1
C10
D0
✗ Incorrect
One-hot encoding creates one column per category, so 5 categories mean 5 columns.
What problem can arise if a categorical feature has hundreds of categories and you use one-hot encoding?
AData becomes too small
BModel runs faster
CData becomes very large and sparse, slowing down the model
DCategories get merged automatically
✗ Incorrect
Many categories create many columns, making data large and sparse, which can slow down training.
Which type of data is one-hot encoding used for?
ANumerical continuous data
BImage data
CText data without categories
DCategorical data
✗ Incorrect
One-hot encoding is specifically for categorical data to convert categories into numbers.
Explain in your own words what one-hot encoding is and why it is useful in machine learning.
Think about how categories are turned into numbers without implying order.
You got /3 concepts.
Describe a situation where one-hot encoding might cause problems and how you might handle it.
Consider what happens if you have hundreds of categories.
You got /3 concepts.
Practice
(1/5)
1. What does one-hot encoding do in machine learning?
easy
A. It converts categorical labels into binary columns with 1s and 0s.
B. It normalizes numerical data to a 0-1 range.
C. It reduces the number of features by combining categories.
D. It fills missing values with the most frequent category.
Solution
Step 1: Understand the purpose of one-hot encoding
One-hot encoding transforms categorical data into a format that machine learning models can use by creating separate binary columns for each category.
Step 2: Compare options with this definition
Only It converts categorical labels into binary columns with 1s and 0s. describes this process correctly; others describe different preprocessing steps.
Final Answer:
It converts categorical labels into binary columns with 1s and 0s. -> Option A
Quick Check:
One-hot encoding = binary columns [OK]
Hint: One-hot means one column per category with 1 or 0 [OK]
Common Mistakes:
Confusing one-hot encoding with normalization
Thinking it reduces features instead of expanding
Mixing it up with missing value imputation
2. Which of the following is the correct way to apply one-hot encoding using pandas in Python?
easy
A. data.encode_onehot('color')
B. data.one_hot_encode('color')
C. pd.onehot(data['color'])
D. pd.get_dummies(data['color'])
Solution
Step 1: Recall pandas function for one-hot encoding
The pandas library uses the function get_dummies() to perform one-hot encoding on a column.
Step 2: Match the correct syntax
Only pd.get_dummies(data['color']) uses the correct function and syntax; other options are invalid pandas methods.
Final Answer:
pd.get_dummies(data['color']) -> Option D
Quick Check:
pandas one-hot = get_dummies() [OK]
Hint: Use pd.get_dummies() for one-hot encoding in pandas [OK]
Common Mistakes:
Using non-existent pandas methods
Trying to call one-hot encoding directly on DataFrame without get_dummies
Hint: get_dummies creates one column per category with 1/0 [OK]
Common Mistakes:
Expecting numeric labels instead of binary columns
Thinking get_dummies returns a list
Assuming get_dummies needs a DataFrame, not Series
4. You wrote this code to one-hot encode a column but get an error:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
encoder.fit(['red', 'blue', 'green'])
What is the error and how to fix it?
medium
A. Error: OneHotEncoder requires numeric input; convert colors to numbers first.
B. Error: input must be 2D array; fix by reshaping input to [['red'], ['blue'], ['green']].
C. Error: OneHotEncoder is deprecated; use pd.get_dummies instead.
D. No error; code runs fine as is.
Solution
Step 1: Identify input shape requirement for OneHotEncoder
sklearn's OneHotEncoder expects a 2D array (like a list of lists), not a 1D list.
Step 2: Fix input shape
Reshape the input to [['red'], ['blue'], ['green']] to make it 2D and avoid the error.
Final Answer:
Error: input must be 2D array; fix by reshaping input to [['red'], ['blue'], ['green']]. -> Option B
Quick Check:
OneHotEncoder input = 2D array [OK]
Hint: OneHotEncoder needs 2D input, reshape 1D list to list of lists [OK]
Common Mistakes:
Passing 1D list instead of 2D array
Thinking OneHotEncoder only works with numbers
Ignoring sklearn input shape requirements
5. You have a dataset with a column 'fruit' containing ['apple', 'banana', 'apple', 'orange', 'banana', 'apple']. You want to one-hot encode it but also keep track of the original order and avoid creating extra columns for unseen fruits later. Which approach is best?
hard
A. Use sklearn's OneHotEncoder with handle_unknown='ignore' and fit on training data only.
B. Use pd.get_dummies on the entire dataset including test data.
C. Manually create columns for each fruit and fill 1 or 0 by checking each row.
D. Convert fruits to numbers using label encoding before one-hot encoding.
Solution
Step 1: Understand the need to handle unseen categories
When encoding training data, unseen categories in test data can cause errors unless handled properly.
Step 2: Choose method that fits training data and ignores unknowns
sklearn's OneHotEncoder with handle_unknown='ignore' fits on training data and safely encodes test data without errors.
Step 3: Avoid pd.get_dummies on combined data to prevent data leakage
Using pd.get_dummies on all data leaks test info into training and may create inconsistent columns.
Final Answer:
Use sklearn's OneHotEncoder with handle_unknown='ignore' and fit on training data only. -> Option A
Quick Check:
OneHotEncoder with ignore unknown = best practice [OK]
Hint: Fit encoder on train, ignore unknown categories in test [OK]
Common Mistakes:
Using pd.get_dummies on combined train and test data
Not handling unknown categories causing errors
Label encoding before one-hot causing wrong model input