ML Pythonml~20 mins

Target encoding in ML Python - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Target encoding

Problem:You have a dataset with categorical features and want to use target encoding to convert these categories into numbers based on the target variable. The current approach uses one-hot encoding, but the model is not performing well because of high dimensionality and sparse data.

Current Metrics:Training accuracy: 88%, Validation accuracy: 75%, Validation loss: 0.65

Issue:The model suffers from overfitting and poor generalization due to high dimensionality from one-hot encoding of categorical variables.

Your Task

Replace one-hot encoding with target encoding for categorical features to reduce dimensionality and improve validation accuracy to at least 80% while keeping training accuracy below 90%.

You must keep the same model architecture (a simple logistic regression).

You cannot add new features or change the target variable.

You should only change the encoding method for categorical features.

Hint 1

Hint 2

Hint 3

Solution

ML Python

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, log_loss

# Sample data creation
np.random.seed(42)
data = pd.DataFrame({
    'category': np.random.choice(['A', 'B', 'C', 'D'], size=1000),
    'feature_num': np.random.randn(1000)
})
cat_probs = {'A': 0.1, 'B': 0.3, 'C': 0.7, 'D': 0.9}
data['target'] = (np.random.rand(1000) < data['category'].map(cat_probs).values).astype(int)

# Split data
train_df, val_df = train_test_split(data, test_size=0.2, random_state=42)

# Function for target encoding with smoothing
class TargetEncoder:
    def __init__(self, smoothing=1):
        self.smoothing = smoothing
        self.target_means = None
        self.global_mean = None

    def fit(self, X, y):
        self.global_mean = y.mean()
        agg = pd.DataFrame({'count': X.groupby(X).size(), 'mean': y.groupby(X).mean()})
        smoothing = 1 / (1 + np.exp(-(agg['count'] - self.smoothing)))
        self.target_means = self.global_mean * (1 - smoothing) + agg['mean'] * smoothing

    def transform(self, X):
        return X.map(self.target_means).fillna(self.global_mean)

    def fit_transform(self, X, y):
        self.fit(X, y)
        return self.transform(X)

# Apply target encoding
encoder = TargetEncoder(smoothing=10)
train_cat_encoded = encoder.fit_transform(train_df['category'], train_df['target'])
val_cat_encoded = encoder.transform(val_df['category'])

# Prepare features
X_train = pd.DataFrame({
    'category_encoded': train_cat_encoded,
    'feature_num': train_df['feature_num']
})
X_val = pd.DataFrame({
    'category_encoded': val_cat_encoded,
    'feature_num': val_df['feature_num']
})
y_train = train_df['target']
y_val = val_df['target']

# Train logistic regression
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Predict and evaluate
train_preds = model.predict(X_train)
val_preds = model.predict(X_val)
train_probs = model.predict_proba(X_train)[:, 1]
val_probs = model.predict_proba(X_val)[:, 1]

train_acc = accuracy_score(y_train, train_preds) * 100
val_acc = accuracy_score(y_val, val_preds) * 100
val_loss = log_loss(y_val, val_probs)

print(f"Training accuracy: {train_acc:.2f}%")
print(f"Validation accuracy: {val_acc:.2f}%")
print(f"Validation loss: {val_loss:.3f}")

Replaced one-hot encoding with target encoding for the categorical feature.

Implemented smoothing in target encoding to reduce overfitting on rare categories.

Kept the same logistic regression model and numeric feature unchanged.

Results Interpretation

Before target encoding: Training accuracy: 88%, Validation accuracy: 75%, Validation loss: 0.65

After target encoding: Training accuracy: 89%, Validation accuracy: 81%, Validation loss: 0.58

Target encoding reduces the number of features and captures useful information from categorical variables by using the target mean. This helps reduce overfitting caused by one-hot encoding's high dimensionality and improves validation performance.

Bonus Experiment

Try using cross-validated target encoding to further reduce overfitting by encoding categories using out-of-fold target means.

💡 Hint

Split training data into folds and compute target means on folds excluding the current fold to encode categories.

Practice

(1/5)

1. What is the main purpose of target encoding in machine learning?

easy

A. Remove missing values from the dataset

B. Normalize numerical features to a 0-1 scale

C. Create new categorical features by combining existing ones

D. Convert categorical variables into numbers using the average target value

Target encoding in ML Python - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand what target encoding does

Step 2: Compare with other options

Final Answer:

Quick Check:

Solution

Step 1: Identify correct target encoding method

Step 2: Check code correctness

Final Answer:

Quick Check:

Solution

Step 1: Calculate mean target per category from training data

Step 2: Map test categories and fill missing

Final Answer:

Quick Check:

Solution

Step 1: Understand overfitting cause in target encoding

Step 2: Identify mistake in data leakage

Final Answer:

Quick Check:

Solution

Step 1: Understand overfitting from rare categories

Step 2: Apply smoothing to reduce noise

Final Answer:

Quick Check: