Target encoding is a way to turn categories into numbers using the target values. The main goal is to improve model predictions. So, the key metrics to check are model accuracy, mean squared error (MSE) for regression, or accuracy, precision, recall, and F1-score for classification. These metrics tell us if the encoding helps the model learn patterns well.
Target encoding in ML Python - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
For classification tasks using target encoding, a confusion matrix helps us see how well the model predicts each class.
Actual \ Predicted | Positive | Negative
-------------------|----------|---------
Positive | 80 | 20
Negative | 10 | 90
Here, True Positives (TP) = 80, False Positives (FP) = 10, True Negatives (TN) = 90, False Negatives (FN) = 20.
Target encoding can cause overfitting if not done carefully. High precision means the model is good at predicting positives correctly without many false alarms. High recall means it finds most of the positive cases.
For example, in fraud detection, recall is more important because missing fraud is costly. If target encoding leaks information from the target, recall might look very high but the model fails on new data.
Good: Balanced precision and recall, stable accuracy on training and test sets, and low error rates. This means target encoding helped the model learn useful patterns without overfitting.
Bad: Very high accuracy on training but low on test, or very high precision but low recall. This suggests target leakage or overfitting caused by improper target encoding.
- Data leakage: Using target values from the same data row to encode can leak information and inflate metrics.
- Overfitting: Target encoding can cause the model to memorize training data, leading to poor generalization.
- Accuracy paradox: High accuracy might hide poor recall or precision, especially with imbalanced classes.
- Ignoring validation: Not using proper cross-validation when applying target encoding can give misleading metrics.
Your model uses target encoding and shows 98% accuracy but only 12% recall on fraud cases. Is it good for production?
Answer: No. The low recall means the model misses most fraud cases, which is dangerous. The high accuracy is misleading because fraud is rare. You should improve recall before using this model.
Practice
target encoding in machine learning?Solution
Step 1: Understand what target encoding does
Target encoding replaces categories with the average value of the target variable for each category.Step 2: Compare with other options
Normalization scales numbers, missing value removal cleans data, and feature creation combines categories, none of which describe target encoding.Final Answer:
Convert categorical variables into numbers using the average target value -> Option DQuick Check:
Target encoding = average target per category [OK]
- Confusing target encoding with normalization
- Thinking target encoding creates new categories
- Assuming target encoding removes missing data
train_df with categorical column cat_col and target target?Solution
Step 1: Identify correct target encoding method
Target encoding maps each category to the mean target value for that category, done by grouping and mapping.Step 2: Check code correctness
mean_target = train_df.groupby('cat_col')['target'].mean(); train_df['cat_encoded'] = train_df['cat_col'].map(mean_target) groups by category, calculates mean target, then maps it back correctly. Other options do not compute mean target per category.Final Answer:
mean_target = train_df.groupby('cat_col')['target'].mean(); train_df['cat_encoded'] = train_df['cat_col'].map(mean_target) -> Option AQuick Check:
Group by category and map mean target [OK]
- Using category codes instead of target mean
- Assigning overall mean target to all rows
- Mapping category length instead of target mean
print(test_df['cat_encoded'].tolist())?
import pandas as pd
train_df = pd.DataFrame({'cat_col': ['A', 'B', 'A', 'C'], 'target': [1, 0, 1, 0]})
mean_target = train_df.groupby('cat_col')['target'].mean()
test_df = pd.DataFrame({'cat_col': ['A', 'B', 'C', 'D']})
test_df['cat_encoded'] = test_df['cat_col'].map(mean_target).fillna(0.5)
print(test_df['cat_encoded'].tolist())Solution
Step 1: Calculate mean target per category from training data
'A' has targets [1,1] mean=1.0, 'B' has [0] mean=0.0, 'C' has [0] mean=0.0.Step 2: Map test categories and fill missing
Test categories 'A','B','C' map to 1.0,0.0,0.0 respectively. 'D' is missing, so fillna(0.5) sets it to 0.5.Final Answer:
[1.0, 0.0, 0.0, 0.5] -> Option AQuick Check:
Map known means, fill unknown with 0.5 [OK]
- Not filling missing categories, resulting in NaN
- Using overall mean instead of per-category mean
- Miscomputing mean target values
Solution
Step 1: Understand overfitting cause in target encoding
Overfitting often happens if target encoding uses information from the test set or entire data before splitting.Step 2: Identify mistake in data leakage
Encoding before splitting leaks target info from test data into training, causing overfitting. Other options do not explain this leakage.Final Answer:
You used target encoding on the entire dataset before splitting into train and test -> Option CQuick Check:
Encoding before split causes data leakage [OK]
- Encoding before train-test split causing leakage
- Confusing normalization with encoding
- Ignoring missing value handling
Solution
Step 1: Understand overfitting from rare categories
Rare categories have few samples, so their target mean can be noisy and cause overfitting.Step 2: Apply smoothing to reduce noise
Smoothing blends the category mean with the overall mean, weighted by how many samples the category has, reducing noise for rare categories.Final Answer:
Use smoothing by combining category mean with overall mean weighted by category frequency -> Option BQuick Check:
Smoothing balances rare category means with global mean [OK]
- Ignoring rare categories causing noisy means
- Replacing rare categories with constants losing info
- Using one-hot encoding which increases dimensionality
