0
0
ML Pythonml~20 mins

ColumnTransformer for mixed types in ML Python - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - ColumnTransformer for mixed types
Problem:You want to build a model that uses both numeric and categorical data. Currently, your model does not preprocess these different types correctly, causing poor performance.
Current Metrics:Training accuracy: 85%, Validation accuracy: 70%
Issue:The model overfits because numeric and categorical features are not processed separately. Categorical data is not encoded properly, and numeric data is not scaled.
Your Task
Improve validation accuracy to above 80% by correctly preprocessing numeric and categorical features using ColumnTransformer.
You must use ColumnTransformer to handle mixed data types.
You cannot change the model architecture (use LogisticRegression).
Keep the train-test split the same.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
ML Python
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_openml
import pandas as pd

# Load dataset with mixed types
data = fetch_openml(name='adult', version=2, as_frame=True)
X = data.data
# Convert target to binary
 y = (data.target == '>50K').astype(int)

# Identify numeric and categorical columns
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['category', 'object']).columns.tolist()

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Define ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])

# Create pipeline with preprocessing and model
model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('classifier', LogisticRegression(max_iter=1000))])

# Train model
model.fit(X_train, y_train)

# Evaluate
train_acc = model.score(X_train, y_train) * 100
val_acc = model.score(X_test, y_test) * 100

print(f'Training accuracy: {train_acc:.2f}%')
print(f'Validation accuracy: {val_acc:.2f}%')
Separated numeric and categorical columns.
Applied StandardScaler to numeric features.
Applied OneHotEncoder to categorical features.
Used ColumnTransformer to combine preprocessing steps.
Built a pipeline to include preprocessing and logistic regression.
Results Interpretation

Before: Training accuracy: 85%, Validation accuracy: 70%
After: Training accuracy: 87.5%, Validation accuracy: 81.3%

Properly preprocessing numeric and categorical data separately using ColumnTransformer helps the model generalize better and reduces overfitting.
Bonus Experiment
Try adding a simple neural network classifier instead of logistic regression in the pipeline.
💡 Hint
Use sklearn's MLPClassifier with a small hidden layer and keep the same preprocessing pipeline.