0
0
Data Analysis Pythondata~20 mins

Encoding categorical variables in Data Analysis Python - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Encoding Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of One-Hot Encoding with pandas
What is the output DataFrame after applying one-hot encoding to the 'Color' column using pandas get_dummies?
Data Analysis Python
import pandas as pd

df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Blue']})
df_encoded = pd.get_dummies(df, columns=['Color'])
print(df_encoded)
A
   Color_Blue  Color_Green  Color_Red
0           0            0          1
1           1            0          0
2           0            1          0
3           1            0          0
B
   Blue  Green  Red
0     0      0    1
1     1      0    0
2     0      1    0
3     1      0    0
C
   Color_Blue  Color_Green  Color_Red
0           1            0          0
1           0            1          0
2           0            0          1
3           0            1          0
D
   Color_Blue  Color_Green  Color_Red
0           0            1          0
1           1            0          0
2           0            0          1
3           1            0          0
Attempts:
2 left
💡 Hint
Remember that get_dummies creates a new column for each category with 1 where the row matches that category.
data_output
intermediate
2:00remaining
Label Encoding Result
What is the array output after label encoding the 'Fruit' list using sklearn's LabelEncoder?
Data Analysis Python
from sklearn.preprocessing import LabelEncoder

fruits = ['apple', 'banana', 'apple', 'orange', 'banana']
encoder = LabelEncoder()
encoded = encoder.fit_transform(fruits)
print(encoded)
A[0 1 0 2 1]
B[1 2 1 3 2]
C[2 1 2 0 1]
D[0 0 1 2 1]
Attempts:
2 left
💡 Hint
LabelEncoder assigns integers starting from 0 in alphabetical order of categories.
🔧 Debug
advanced
2:00remaining
Error in One-Hot Encoding with Unknown Categories
What error will this code raise when transforming new data with unseen categories using sklearn's OneHotEncoder?
Data Analysis Python
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(handle_unknown='error')
encoder.fit([['cat'], ['dog']])
encoder.transform([['cat'], ['bird']])
AKeyError: 'bird'
BValueError: Found unknown categories ['bird'] during transform
CTypeError: unhashable type: 'list'
DNo error, output is a sparse matrix
Attempts:
2 left
💡 Hint
By default, OneHotEncoder raises an error if it sees categories not seen during fit.
🚀 Application
advanced
2:00remaining
Choosing Encoding for High Cardinality Feature
You have a categorical feature with 10,000 unique values. Which encoding method is best to reduce memory and avoid too many columns?
ALabel encoding
BFrequency encoding
CBinary encoding
DOne-hot encoding
Attempts:
2 left
💡 Hint
One-hot encoding creates one column per category, which is large here.
🧠 Conceptual
expert
2:00remaining
Effect of Label Encoding on Tree-Based Models
Why can label encoding categorical variables be problematic for linear models but usually acceptable for tree-based models?
ALinear models handle missing values better than tree models when using label encoding.
BTree models require numeric labels, linear models do not.
CLabel encoding creates dummy variables that confuse linear models but not tree models.
DLinear models assume numeric order in labels, which can mislead them; tree models split on values without assuming order.
Attempts:
2 left
💡 Hint
Think about how models interpret numeric values of categories.