Bird
Raised Fist0
ML Pythonml~20 mins

Target encoding in ML Python - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
Target Encoding Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
Why use target encoding instead of one-hot encoding?

Which of the following is the main advantage of using target encoding for categorical variables compared to one-hot encoding?

ATarget encoding always prevents overfitting better than one-hot encoding.
BTarget encoding converts categorical variables into multiple binary columns.
CTarget encoding reduces dimensionality by converting categories into a single numeric value based on the target variable.
DTarget encoding does not require the target variable during training.
Attempts:
2 left
💡 Hint

Think about how target encoding transforms categories compared to one-hot encoding.

Predict Output
intermediate
2:00remaining
Output of target encoding with smoothing

Given the following data and target encoding with smoothing, what is the encoded value for category 'B'?

ML Python
import pandas as pd

data = pd.DataFrame({'category': ['A', 'B', 'B', 'C', 'A', 'B'], 'target': [1, 0, 1, 0, 1, 1]})

# Global mean of target
global_mean = data['target'].mean()

# Calculate category mean and count
category_stats = data.groupby('category')['target'].agg(['mean', 'count'])

# Smoothing factor
alpha = 2

# Calculate smoothed mean
category_stats['smoothed'] = (category_stats['mean'] * category_stats['count'] + global_mean * alpha) / (category_stats['count'] + alpha)

encoded_value_B = category_stats.loc['B', 'smoothed']
print(round(encoded_value_B, 3))
A0.5
B0.667
C0.75
D0.4
Attempts:
2 left
💡 Hint

Calculate the category mean and count for 'B', then apply the smoothing formula.

Model Choice
advanced
2:00remaining
Choosing target encoding for model type

For which type of model is target encoding most beneficial compared to one-hot encoding?

ANeural networks that require one-hot encoded inputs.
BTree-based models like random forests that handle categorical variables natively.
CK-nearest neighbors models that rely on distance metrics.
DLinear regression models that assume numeric input features.
Attempts:
2 left
💡 Hint

Consider how different models interpret numeric vs categorical features.

Metrics
advanced
2:00remaining
Effect of target encoding on model evaluation metrics

What is a common risk when using target encoding without proper cross-validation, and how does it affect evaluation metrics?

AData leakage causing overly optimistic metrics like accuracy or RMSE on validation data.
BUnderfitting leading to poor training accuracy but good validation accuracy.
CIncreased model bias causing metrics to be consistently low on both train and test sets.
DNo effect on metrics because target encoding is independent of the target variable.
Attempts:
2 left
💡 Hint

Think about what happens if the target information leaks into the features during training.

🔧 Debug
expert
2:00remaining
Debugging target encoding implementation error

Consider this code snippet for target encoding. What error will it raise when run?

ML Python
import pandas as pd

data = pd.DataFrame({'cat': ['x', 'y', 'x', 'z'], 'target': [1, 0, 1, 0]})

means = data.groupby('cat')['target'].mean()

# Incorrect: trying to map means to original data without converting to dict
encoded = data['cat'].map(means)

print(encoded)
ANo error; prints encoded values correctly.
BKeyError because 'cat' values are missing in means index.
CTypeError because map expects a dict but gets a Series.
DValueError due to mismatched index during mapping.
Attempts:
2 left
💡 Hint

Check if pandas Series can be used directly with map for mapping values.

Practice

(1/5)
1. What is the main purpose of target encoding in machine learning?
easy
A. Remove missing values from the dataset
B. Normalize numerical features to a 0-1 scale
C. Create new categorical features by combining existing ones
D. Convert categorical variables into numbers using the average target value

Solution

  1. Step 1: Understand what target encoding does

    Target encoding replaces categories with the average value of the target variable for each category.
  2. Step 2: Compare with other options

    Normalization scales numbers, missing value removal cleans data, and feature creation combines categories, none of which describe target encoding.
  3. Final Answer:

    Convert categorical variables into numbers using the average target value -> Option D
  4. Quick Check:

    Target encoding = average target per category [OK]
Hint: Target encoding uses target averages to convert categories [OK]
Common Mistakes:
  • Confusing target encoding with normalization
  • Thinking target encoding creates new categories
  • Assuming target encoding removes missing data
2. Which of the following Python code snippets correctly applies target encoding using pandas for a training dataset train_df with categorical column cat_col and target target?
easy
A. mean_target = train_df.groupby('cat_col')['target'].mean(); train_df['cat_encoded'] = train_df['cat_col'].map(mean_target)
B. train_df['cat_encoded'] = train_df['cat_col'].astype('category').cat.codes
C. train_df['cat_encoded'] = train_df['target'].mean()
D. train_df['cat_encoded'] = train_df['cat_col'].apply(lambda x: len(x))

Solution

  1. Step 1: Identify correct target encoding method

    Target encoding maps each category to the mean target value for that category, done by grouping and mapping.
  2. Step 2: Check code correctness

    mean_target = train_df.groupby('cat_col')['target'].mean(); train_df['cat_encoded'] = train_df['cat_col'].map(mean_target) groups by category, calculates mean target, then maps it back correctly. Other options do not compute mean target per category.
  3. Final Answer:

    mean_target = train_df.groupby('cat_col')['target'].mean(); train_df['cat_encoded'] = train_df['cat_col'].map(mean_target) -> Option A
  4. Quick Check:

    Group by category and map mean target [OK]
Hint: Group by category and map mean target for encoding [OK]
Common Mistakes:
  • Using category codes instead of target mean
  • Assigning overall mean target to all rows
  • Mapping category length instead of target mean
3. Given the following code, what will be the output of print(test_df['cat_encoded'].tolist())?
import pandas as pd
train_df = pd.DataFrame({'cat_col': ['A', 'B', 'A', 'C'], 'target': [1, 0, 1, 0]})
mean_target = train_df.groupby('cat_col')['target'].mean()
test_df = pd.DataFrame({'cat_col': ['A', 'B', 'C', 'D']})
test_df['cat_encoded'] = test_df['cat_col'].map(mean_target).fillna(0.5)
print(test_df['cat_encoded'].tolist())
medium
A. [1.0, 0.0, 0.0, 0.5]
B. [1.0, 0.0, 0.0, 0.0]
C. [1.0, 0.0, 0.0, NaN]
D. [0.5, 0.5, 0.5, 0.5]

Solution

  1. Step 1: Calculate mean target per category from training data

    'A' has targets [1,1] mean=1.0, 'B' has [0] mean=0.0, 'C' has [0] mean=0.0.
  2. Step 2: Map test categories and fill missing

    Test categories 'A','B','C' map to 1.0,0.0,0.0 respectively. 'D' is missing, so fillna(0.5) sets it to 0.5.
  3. Final Answer:

    [1.0, 0.0, 0.0, 0.5] -> Option A
  4. Quick Check:

    Map known means, fill unknown with 0.5 [OK]
Hint: Fill missing categories with default value after mapping [OK]
Common Mistakes:
  • Not filling missing categories, resulting in NaN
  • Using overall mean instead of per-category mean
  • Miscomputing mean target values
4. You applied target encoding on your training data and then directly applied the same encoding on test data using the training means. However, your model shows signs of overfitting. What is the most likely mistake?
medium
A. You replaced missing values with zero instead of the mean
B. You did not normalize the target variable before encoding
C. You used target encoding on the entire dataset before splitting into train and test
D. You used one-hot encoding instead of target encoding

Solution

  1. Step 1: Understand overfitting cause in target encoding

    Overfitting often happens if target encoding uses information from the test set or entire data before splitting.
  2. Step 2: Identify mistake in data leakage

    Encoding before splitting leaks target info from test data into training, causing overfitting. Other options do not explain this leakage.
  3. Final Answer:

    You used target encoding on the entire dataset before splitting into train and test -> Option C
  4. Quick Check:

    Encoding before split causes data leakage [OK]
Hint: Always fit encoding only on training data to avoid leakage [OK]
Common Mistakes:
  • Encoding before train-test split causing leakage
  • Confusing normalization with encoding
  • Ignoring missing value handling
5. You have a categorical feature with many rare categories in your training data. How can you apply target encoding to reduce overfitting caused by these rare categories?
hard
A. Use one-hot encoding instead of target encoding for rare categories
B. Use smoothing by combining category mean with overall mean weighted by category frequency
C. Apply target encoding only on the most frequent category and ignore others
D. Replace rare categories with a fixed constant before encoding

Solution

  1. Step 1: Understand overfitting from rare categories

    Rare categories have few samples, so their target mean can be noisy and cause overfitting.
  2. Step 2: Apply smoothing to reduce noise

    Smoothing blends the category mean with the overall mean, weighted by how many samples the category has, reducing noise for rare categories.
  3. Final Answer:

    Use smoothing by combining category mean with overall mean weighted by category frequency -> Option B
  4. Quick Check:

    Smoothing balances rare category means with global mean [OK]
Hint: Smooth rare categories by mixing with overall mean [OK]
Common Mistakes:
  • Ignoring rare categories causing noisy means
  • Replacing rare categories with constants losing info
  • Using one-hot encoding which increases dimensionality