Bird
Raised Fist0
ML Pythonml~20 mins

One-hot encoding in ML Python - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
One-hot Encoding Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
What is the output of this one-hot encoding code?
Consider this Python code using pandas to one-hot encode a categorical column. What is the printed output?
ML Python
import pandas as pd

data = {'color': ['red', 'blue', 'green', 'blue']}
df = pd.DataFrame(data)
dummies = pd.get_dummies(df['color'])
print(dummies)
A
   red  blue  green
0    1     0      0
1    0     1      0
2    0     0      1
3    0     1      0
B
   blue  green  red
0     0      0    1
1     1      0    0
2     0      1    0
3     1      0    0
C
   green  red  blue
0      0    1     0
1      0    0     1
2      1    0     0
3      0    0     1
D
   red  green  blue
0    1      0     0
1    0      0     1
2    0      1     0
3    0      0     1
Attempts:
2 left
💡 Hint
Look at the order of columns created by pd.get_dummies and how the rows correspond to the original data.
🧠 Conceptual
intermediate
1:30remaining
Why use one-hot encoding for categorical data?
Which of the following is the main reason to use one-hot encoding on categorical features before training a machine learning model?
ATo convert categories into numbers so the model can process them without assuming order.
BTo reduce the number of features and simplify the model.
CTo normalize the data between 0 and 1 for better training.
DTo combine multiple categories into a single numeric value.
Attempts:
2 left
💡 Hint
Think about how models interpret numeric inputs and what happens if categories are just assigned numbers.
Hyperparameter
advanced
2:00remaining
Choosing the right approach for high-cardinality categorical features
You have a categorical feature with 10,000 unique values. Which one-hot encoding approach is best to avoid excessive memory use?
AUse standard one-hot encoding creating 10,000 binary columns.
BUse label encoding assigning integers to categories.
CUse one-hot encoding but drop one category to reduce columns by one.
DUse target encoding to replace categories with average target values.
Attempts:
2 left
💡 Hint
Think about memory and model performance when many categories exist.
Metrics
advanced
1:30remaining
Effect of one-hot encoding on model accuracy
You train two models on the same dataset: Model A uses label encoding for a categorical feature, Model B uses one-hot encoding. Model B shows better accuracy. Why?
ALabel encoding always causes models to overfit, reducing accuracy.
BOne-hot encoding reduces the number of features, making training faster and more accurate.
COne-hot encoding prevents the model from assuming an order in categories, improving accuracy.
DLabel encoding converts categories to strings, which models cannot process.
Attempts:
2 left
💡 Hint
Consider how models interpret numeric values from label encoding.
🔧 Debug
expert
2:30remaining
Why does this one-hot encoding code raise an error?
What error does this code raise and why? import numpy as np categories = ['cat', 'dog', 'bird'] values = ['cat', 'dog', 'fish'] one_hot = np.zeros((len(values), len(categories))) for i, val in enumerate(values): idx = categories.index(val) one_hot[i, idx] = 1 print(one_hot)
AValueError: 'fish' is not in list because 'fish' is not in categories.
BIndexError: index out of range because 'fish' index is too large.
CTypeError: unsupported operand type(s) because of wrong data types.
DNo error, prints a one-hot encoded numpy array.
Attempts:
2 left
💡 Hint
Check if all values exist in the categories list before indexing.

Practice

(1/5)
1. What does one-hot encoding do in machine learning?
easy
A. It converts categorical labels into binary columns with 1s and 0s.
B. It normalizes numerical data to a 0-1 range.
C. It reduces the number of features by combining categories.
D. It fills missing values with the most frequent category.

Solution

  1. Step 1: Understand the purpose of one-hot encoding

    One-hot encoding transforms categorical data into a format that machine learning models can use by creating separate binary columns for each category.
  2. Step 2: Compare options with this definition

    Only It converts categorical labels into binary columns with 1s and 0s. describes this process correctly; others describe different preprocessing steps.
  3. Final Answer:

    It converts categorical labels into binary columns with 1s and 0s. -> Option A
  4. Quick Check:

    One-hot encoding = binary columns [OK]
Hint: One-hot means one column per category with 1 or 0 [OK]
Common Mistakes:
  • Confusing one-hot encoding with normalization
  • Thinking it reduces features instead of expanding
  • Mixing it up with missing value imputation
2. Which of the following is the correct way to apply one-hot encoding using pandas in Python?
easy
A. data.encode_onehot('color')
B. data.one_hot_encode('color')
C. pd.onehot(data['color'])
D. pd.get_dummies(data['color'])

Solution

  1. Step 1: Recall pandas function for one-hot encoding

    The pandas library uses the function get_dummies() to perform one-hot encoding on a column.
  2. Step 2: Match the correct syntax

    Only pd.get_dummies(data['color']) uses the correct function and syntax; other options are invalid pandas methods.
  3. Final Answer:

    pd.get_dummies(data['color']) -> Option D
  4. Quick Check:

    pandas one-hot = get_dummies() [OK]
Hint: Use pd.get_dummies() for one-hot encoding in pandas [OK]
Common Mistakes:
  • Using non-existent pandas methods
  • Trying to call one-hot encoding directly on DataFrame without get_dummies
  • Confusing method names
3. Given the code:
import pandas as pd
colors = ['red', 'blue', 'green', 'blue']
df = pd.DataFrame({'color': colors})
encoded = pd.get_dummies(df['color'])
print(encoded)

What is the printed output?
medium
A. A list of encoded numbers like [0,1,2,1].
B. An error because get_dummies requires a DataFrame, not a Series.
C. A DataFrame with columns 'red', 'blue', 'green' containing 1s and 0s for each row.
D. A DataFrame with a single column showing the original colors.

Solution

  1. Step 1: Understand pd.get_dummies on a Series

    Applying pd.get_dummies on a Series creates a DataFrame with one column per unique category, filled with 1s and 0s indicating presence.
  2. Step 2: Predict the output for given colors

    Since colors are 'red', 'blue', 'green', 'blue', the output will have columns 'blue', 'green', 'red' with 1s where the color matches and 0s otherwise.
  3. Final Answer:

    A DataFrame with columns 'red', 'blue', 'green' containing 1s and 0s for each row. -> Option C
  4. Quick Check:

    get_dummies output = binary columns DataFrame [OK]
Hint: get_dummies creates one column per category with 1/0 [OK]
Common Mistakes:
  • Expecting numeric labels instead of binary columns
  • Thinking get_dummies returns a list
  • Assuming get_dummies needs a DataFrame, not Series
4. You wrote this code to one-hot encode a column but get an error:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
encoder.fit(['red', 'blue', 'green'])

What is the error and how to fix it?
medium
A. Error: OneHotEncoder requires numeric input; convert colors to numbers first.
B. Error: input must be 2D array; fix by reshaping input to [['red'], ['blue'], ['green']].
C. Error: OneHotEncoder is deprecated; use pd.get_dummies instead.
D. No error; code runs fine as is.

Solution

  1. Step 1: Identify input shape requirement for OneHotEncoder

    sklearn's OneHotEncoder expects a 2D array (like a list of lists), not a 1D list.
  2. Step 2: Fix input shape

    Reshape the input to [['red'], ['blue'], ['green']] to make it 2D and avoid the error.
  3. Final Answer:

    Error: input must be 2D array; fix by reshaping input to [['red'], ['blue'], ['green']]. -> Option B
  4. Quick Check:

    OneHotEncoder input = 2D array [OK]
Hint: OneHotEncoder needs 2D input, reshape 1D list to list of lists [OK]
Common Mistakes:
  • Passing 1D list instead of 2D array
  • Thinking OneHotEncoder only works with numbers
  • Ignoring sklearn input shape requirements
5. You have a dataset with a column 'fruit' containing ['apple', 'banana', 'apple', 'orange', 'banana', 'apple']. You want to one-hot encode it but also keep track of the original order and avoid creating extra columns for unseen fruits later. Which approach is best?
hard
A. Use sklearn's OneHotEncoder with handle_unknown='ignore' and fit on training data only.
B. Use pd.get_dummies on the entire dataset including test data.
C. Manually create columns for each fruit and fill 1 or 0 by checking each row.
D. Convert fruits to numbers using label encoding before one-hot encoding.

Solution

  1. Step 1: Understand the need to handle unseen categories

    When encoding training data, unseen categories in test data can cause errors unless handled properly.
  2. Step 2: Choose method that fits training data and ignores unknowns

    sklearn's OneHotEncoder with handle_unknown='ignore' fits on training data and safely encodes test data without errors.
  3. Step 3: Avoid pd.get_dummies on combined data to prevent data leakage

    Using pd.get_dummies on all data leaks test info into training and may create inconsistent columns.
  4. Final Answer:

    Use sklearn's OneHotEncoder with handle_unknown='ignore' and fit on training data only. -> Option A
  5. Quick Check:

    OneHotEncoder with ignore unknown = best practice [OK]
Hint: Fit encoder on train, ignore unknown categories in test [OK]
Common Mistakes:
  • Using pd.get_dummies on combined train and test data
  • Not handling unknown categories causing errors
  • Label encoding before one-hot causing wrong model input