Which of the following is the main advantage of using target encoding for categorical variables compared to one-hot encoding?
Think about how target encoding transforms categories compared to one-hot encoding.
Target encoding replaces each category with a number derived from the target variable, reducing the number of features. One-hot encoding creates many binary columns, increasing dimensionality.
Given the following data and target encoding with smoothing, what is the encoded value for category 'B'?
import pandas as pd data = pd.DataFrame({'category': ['A', 'B', 'B', 'C', 'A', 'B'], 'target': [1, 0, 1, 0, 1, 1]}) # Global mean of target global_mean = data['target'].mean() # Calculate category mean and count category_stats = data.groupby('category')['target'].agg(['mean', 'count']) # Smoothing factor alpha = 2 # Calculate smoothed mean category_stats['smoothed'] = (category_stats['mean'] * category_stats['count'] + global_mean * alpha) / (category_stats['count'] + alpha) encoded_value_B = category_stats.loc['B', 'smoothed'] print(round(encoded_value_B, 3))
Calculate the category mean and count for 'B', then apply the smoothing formula.
Category 'B' has target values [0,1,1], mean=0.667, count=3. Global mean=0.667. Smoothing: (0.667*3 + 0.667*2)/(3+2) = (2 + 1.333)/5 = 0.667. Rounded to 0.667.
For which type of model is target encoding most beneficial compared to one-hot encoding?
Consider how different models interpret numeric vs categorical features.
Linear models benefit from target encoding because it converts categories into meaningful numeric values. Tree models can handle categories without encoding, and neural networks often prefer one-hot encoding.
What is a common risk when using target encoding without proper cross-validation, and how does it affect evaluation metrics?
Think about what happens if the target information leaks into the features during training.
Target encoding uses the target variable to encode features. If encoding is done on the whole dataset before splitting, it leaks target info into features, causing inflated evaluation metrics.
Consider this code snippet for target encoding. What error will it raise when run?
import pandas as pd data = pd.DataFrame({'cat': ['x', 'y', 'x', 'z'], 'target': [1, 0, 1, 0]}) means = data.groupby('cat')['target'].mean() # Incorrect: trying to map means to original data without converting to dict encoded = data['cat'].map(means) print(encoded)
Check if pandas Series can be used directly with map for mapping values.
pandas.Series can be used directly with map to replace values based on index. No error occurs and encoded prints correctly.