Target encoding helps turn categories into numbers by using the average of the target values. This makes it easier for models to understand and use categorical data.
Target encoding in ML Python
Start learning this pattern below
Jump into concepts and practice - no test required
from category_encoders import TargetEncoder encoder = TargetEncoder(cols=['category_column']) encoded_data = encoder.fit_transform(X, y)
You need to install the category_encoders package first using pip install category_encoders.
cols specifies which columns to encode. X is your features, and y is the target variable.
from category_encoders import TargetEncoder encoder = TargetEncoder(cols=['color']) encoded_X = encoder.fit_transform(X, y)
encoder = TargetEncoder(cols=['city', 'brand']) encoded_X = encoder.fit_transform(X, y)
encoded_X = encoder.transform(new_X)
This example shows how to use target encoding on a simple dataset with a 'color' category and a binary target. We encode the 'color' column, train a logistic regression model, and check accuracy.
import pandas as pd from category_encoders import TargetEncoder from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score # Sample data with categorical feature and binary target data = pd.DataFrame({ 'color': ['red', 'blue', 'green', 'blue', 'red', 'green', 'red', 'blue'], 'target': [1, 0, 1, 0, 1, 0, 1, 0] }) X = data[['color']] y = data['target'] # Split data X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42) # Create and fit target encoder encoder = TargetEncoder(cols=['color']) X_train_encoded = encoder.fit_transform(X_train, y_train) X_test_encoded = encoder.transform(X_test) # Train a simple model model = LogisticRegression() model.fit(X_train_encoded, y_train) # Predict and evaluate preds = model.predict(X_test_encoded) acc = accuracy_score(y_test, preds) print(f"Encoded training data:\n{X_train_encoded}\n") print(f"Predictions: {preds}") print(f"Accuracy: {acc:.2f}")
Target encoding can cause overfitting if not done carefully. Use techniques like cross-validation or smoothing.
Always fit the encoder only on training data, then transform test data to avoid data leakage.
Target encoding works best with categorical features that have a meaningful relationship with the target.
Target encoding converts categories into numbers using the average target value.
It helps models use categorical data without creating many new columns.
Be careful to avoid overfitting by fitting only on training data.
Practice
target encoding in machine learning?Solution
Step 1: Understand what target encoding does
Target encoding replaces categories with the average value of the target variable for each category.Step 2: Compare with other options
Normalization scales numbers, missing value removal cleans data, and feature creation combines categories, none of which describe target encoding.Final Answer:
Convert categorical variables into numbers using the average target value -> Option DQuick Check:
Target encoding = average target per category [OK]
- Confusing target encoding with normalization
- Thinking target encoding creates new categories
- Assuming target encoding removes missing data
train_df with categorical column cat_col and target target?Solution
Step 1: Identify correct target encoding method
Target encoding maps each category to the mean target value for that category, done by grouping and mapping.Step 2: Check code correctness
mean_target = train_df.groupby('cat_col')['target'].mean(); train_df['cat_encoded'] = train_df['cat_col'].map(mean_target) groups by category, calculates mean target, then maps it back correctly. Other options do not compute mean target per category.Final Answer:
mean_target = train_df.groupby('cat_col')['target'].mean(); train_df['cat_encoded'] = train_df['cat_col'].map(mean_target) -> Option AQuick Check:
Group by category and map mean target [OK]
- Using category codes instead of target mean
- Assigning overall mean target to all rows
- Mapping category length instead of target mean
print(test_df['cat_encoded'].tolist())?
import pandas as pd
train_df = pd.DataFrame({'cat_col': ['A', 'B', 'A', 'C'], 'target': [1, 0, 1, 0]})
mean_target = train_df.groupby('cat_col')['target'].mean()
test_df = pd.DataFrame({'cat_col': ['A', 'B', 'C', 'D']})
test_df['cat_encoded'] = test_df['cat_col'].map(mean_target).fillna(0.5)
print(test_df['cat_encoded'].tolist())Solution
Step 1: Calculate mean target per category from training data
'A' has targets [1,1] mean=1.0, 'B' has [0] mean=0.0, 'C' has [0] mean=0.0.Step 2: Map test categories and fill missing
Test categories 'A','B','C' map to 1.0,0.0,0.0 respectively. 'D' is missing, so fillna(0.5) sets it to 0.5.Final Answer:
[1.0, 0.0, 0.0, 0.5] -> Option AQuick Check:
Map known means, fill unknown with 0.5 [OK]
- Not filling missing categories, resulting in NaN
- Using overall mean instead of per-category mean
- Miscomputing mean target values
Solution
Step 1: Understand overfitting cause in target encoding
Overfitting often happens if target encoding uses information from the test set or entire data before splitting.Step 2: Identify mistake in data leakage
Encoding before splitting leaks target info from test data into training, causing overfitting. Other options do not explain this leakage.Final Answer:
You used target encoding on the entire dataset before splitting into train and test -> Option CQuick Check:
Encoding before split causes data leakage [OK]
- Encoding before train-test split causing leakage
- Confusing normalization with encoding
- Ignoring missing value handling
Solution
Step 1: Understand overfitting from rare categories
Rare categories have few samples, so their target mean can be noisy and cause overfitting.Step 2: Apply smoothing to reduce noise
Smoothing blends the category mean with the overall mean, weighted by how many samples the category has, reducing noise for rare categories.Final Answer:
Use smoothing by combining category mean with overall mean weighted by category frequency -> Option BQuick Check:
Smoothing balances rare category means with global mean [OK]
- Ignoring rare categories causing noisy means
- Replacing rare categories with constants losing info
- Using one-hot encoding which increases dimensionality
