0
0
ML Pythonml~20 mins

Target encoding in ML Python - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Target encoding
Problem:You have a dataset with categorical features and want to use target encoding to convert these categories into numbers based on the target variable. The current approach uses one-hot encoding, but the model is not performing well because of high dimensionality and sparse data.
Current Metrics:Training accuracy: 88%, Validation accuracy: 75%, Validation loss: 0.65
Issue:The model suffers from overfitting and poor generalization due to high dimensionality from one-hot encoding of categorical variables.
Your Task
Replace one-hot encoding with target encoding for categorical features to reduce dimensionality and improve validation accuracy to at least 80% while keeping training accuracy below 90%.
You must keep the same model architecture (a simple logistic regression).
You cannot add new features or change the target variable.
You should only change the encoding method for categorical features.
Hint 1
Hint 2
Hint 3
Solution
ML Python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, log_loss

# Sample data creation
np.random.seed(42)
data = pd.DataFrame({
    'category': np.random.choice(['A', 'B', 'C', 'D'], size=1000),
    'feature_num': np.random.randn(1000)
})
cat_probs = {'A': 0.1, 'B': 0.3, 'C': 0.7, 'D': 0.9}
data['target'] = (np.random.rand(1000) < data['category'].map(cat_probs).values).astype(int)

# Split data
train_df, val_df = train_test_split(data, test_size=0.2, random_state=42)

# Function for target encoding with smoothing
class TargetEncoder:
    def __init__(self, smoothing=1):
        self.smoothing = smoothing
        self.target_means = None
        self.global_mean = None

    def fit(self, X, y):
        self.global_mean = y.mean()
        agg = pd.DataFrame({'count': X.groupby(X).size(), 'mean': y.groupby(X).mean()})
        smoothing = 1 / (1 + np.exp(-(agg['count'] - self.smoothing)))
        self.target_means = self.global_mean * (1 - smoothing) + agg['mean'] * smoothing

    def transform(self, X):
        return X.map(self.target_means).fillna(self.global_mean)

    def fit_transform(self, X, y):
        self.fit(X, y)
        return self.transform(X)

# Apply target encoding
encoder = TargetEncoder(smoothing=10)
train_cat_encoded = encoder.fit_transform(train_df['category'], train_df['target'])
val_cat_encoded = encoder.transform(val_df['category'])

# Prepare features
X_train = pd.DataFrame({
    'category_encoded': train_cat_encoded,
    'feature_num': train_df['feature_num']
})
X_val = pd.DataFrame({
    'category_encoded': val_cat_encoded,
    'feature_num': val_df['feature_num']
})
y_train = train_df['target']
y_val = val_df['target']

# Train logistic regression
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Predict and evaluate
train_preds = model.predict(X_train)
val_preds = model.predict(X_val)
train_probs = model.predict_proba(X_train)[:, 1]
val_probs = model.predict_proba(X_val)[:, 1]

train_acc = accuracy_score(y_train, train_preds) * 100
val_acc = accuracy_score(y_val, val_preds) * 100
val_loss = log_loss(y_val, val_probs)

print(f"Training accuracy: {train_acc:.2f}%")
print(f"Validation accuracy: {val_acc:.2f}%")
print(f"Validation loss: {val_loss:.3f}")
Replaced one-hot encoding with target encoding for the categorical feature.
Implemented smoothing in target encoding to reduce overfitting on rare categories.
Kept the same logistic regression model and numeric feature unchanged.
Results Interpretation

Before target encoding: Training accuracy: 88%, Validation accuracy: 75%, Validation loss: 0.65

After target encoding: Training accuracy: 89%, Validation accuracy: 81%, Validation loss: 0.58

Target encoding reduces the number of features and captures useful information from categorical variables by using the target mean. This helps reduce overfitting caused by one-hot encoding's high dimensionality and improves validation performance.
Bonus Experiment
Try using cross-validated target encoding to further reduce overfitting by encoding categories using out-of-fold target means.
💡 Hint
Split training data into folds and compute target means on folds excluding the current fold to encode categories.