What if your model could understand categories by their real impact instead of just random numbers?
Why Target encoding in ML Python? - Purpose & Use Cases
Start learning this pattern below
Jump into concepts and practice - no test required
Imagine you have a big table of customer data with categories like 'City' or 'Product Type'. You want to use this data to predict if a customer will buy something. But your computer only understands numbers, not words.
You try to convert these categories into numbers by hand, maybe by assigning 1 to 'New York', 2 to 'London', and so on.
This manual numbering is slow and tricky. It treats categories as if they have order or size, which they don't. Also, if a new city appears later, you have to stop and add it manually. This can cause mistakes and confuse your prediction model.
Target encoding smartly replaces each category with the average outcome (target) for that category. For example, if customers from 'New York' buy 70% of the time, 'New York' becomes 0.7. This way, the model gets meaningful numbers that relate directly to what you want to predict.
city_map = {'New York': 1, 'London': 2, 'Paris': 3}
data['city_num'] = data['city'].map(city_map)mean_target = data.groupby('city')['target'].mean() data['city_enc'] = data['city'].map(mean_target)
Target encoding lets your model learn from categories in a way that captures their true relationship with the goal, improving predictions without complex manual work.
In online shopping, target encoding can turn product categories into numbers that show how likely each product type is to be bought, helping recommenders suggest better items.
Manual category numbering is slow and can mislead models.
Target encoding uses the average target value per category for smarter numbers.
This improves model accuracy and handles new categories gracefully.
Practice
target encoding in machine learning?Solution
Step 1: Understand what target encoding does
Target encoding replaces categories with the average value of the target variable for each category.Step 2: Compare with other options
Normalization scales numbers, missing value removal cleans data, and feature creation combines categories, none of which describe target encoding.Final Answer:
Convert categorical variables into numbers using the average target value -> Option DQuick Check:
Target encoding = average target per category [OK]
- Confusing target encoding with normalization
- Thinking target encoding creates new categories
- Assuming target encoding removes missing data
train_df with categorical column cat_col and target target?Solution
Step 1: Identify correct target encoding method
Target encoding maps each category to the mean target value for that category, done by grouping and mapping.Step 2: Check code correctness
mean_target = train_df.groupby('cat_col')['target'].mean(); train_df['cat_encoded'] = train_df['cat_col'].map(mean_target) groups by category, calculates mean target, then maps it back correctly. Other options do not compute mean target per category.Final Answer:
mean_target = train_df.groupby('cat_col')['target'].mean(); train_df['cat_encoded'] = train_df['cat_col'].map(mean_target) -> Option AQuick Check:
Group by category and map mean target [OK]
- Using category codes instead of target mean
- Assigning overall mean target to all rows
- Mapping category length instead of target mean
print(test_df['cat_encoded'].tolist())?
import pandas as pd
train_df = pd.DataFrame({'cat_col': ['A', 'B', 'A', 'C'], 'target': [1, 0, 1, 0]})
mean_target = train_df.groupby('cat_col')['target'].mean()
test_df = pd.DataFrame({'cat_col': ['A', 'B', 'C', 'D']})
test_df['cat_encoded'] = test_df['cat_col'].map(mean_target).fillna(0.5)
print(test_df['cat_encoded'].tolist())Solution
Step 1: Calculate mean target per category from training data
'A' has targets [1,1] mean=1.0, 'B' has [0] mean=0.0, 'C' has [0] mean=0.0.Step 2: Map test categories and fill missing
Test categories 'A','B','C' map to 1.0,0.0,0.0 respectively. 'D' is missing, so fillna(0.5) sets it to 0.5.Final Answer:
[1.0, 0.0, 0.0, 0.5] -> Option AQuick Check:
Map known means, fill unknown with 0.5 [OK]
- Not filling missing categories, resulting in NaN
- Using overall mean instead of per-category mean
- Miscomputing mean target values
Solution
Step 1: Understand overfitting cause in target encoding
Overfitting often happens if target encoding uses information from the test set or entire data before splitting.Step 2: Identify mistake in data leakage
Encoding before splitting leaks target info from test data into training, causing overfitting. Other options do not explain this leakage.Final Answer:
You used target encoding on the entire dataset before splitting into train and test -> Option CQuick Check:
Encoding before split causes data leakage [OK]
- Encoding before train-test split causing leakage
- Confusing normalization with encoding
- Ignoring missing value handling
Solution
Step 1: Understand overfitting from rare categories
Rare categories have few samples, so their target mean can be noisy and cause overfitting.Step 2: Apply smoothing to reduce noise
Smoothing blends the category mean with the overall mean, weighted by how many samples the category has, reducing noise for rare categories.Final Answer:
Use smoothing by combining category mean with overall mean weighted by category frequency -> Option BQuick Check:
Smoothing balances rare category means with global mean [OK]
- Ignoring rare categories causing noisy means
- Replacing rare categories with constants losing info
- Using one-hot encoding which increases dimensionality
