ML Pythonml~20 mins

Custom transformers in ML Python - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Custom transformers

Problem:You want to preprocess data for a machine learning model by creating a custom transformer that scales numerical features and encodes categorical features. The current pipeline uses separate steps but is not reusable or clean.

Current Metrics:Pipeline runs but is hard to maintain and reuse. No accuracy metric yet as preprocessing is manual and separate.

Issue:The current preprocessing is not modular or reusable. It mixes scaling and encoding outside a single transformer, making the pipeline complex and error-prone.

Your Task

Create a custom transformer class that combines scaling of numerical features and one-hot encoding of categorical features into one reusable transformer. Use it in a pipeline and verify it transforms data correctly.

Use scikit-learn's TransformerMixin and BaseEstimator to build the custom transformer.

Do not use existing combined transformers like ColumnTransformer directly for this task.

The transformer must implement fit and transform methods.

Hint 1

Hint 2

Hint 3

Solution

ML Python

import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline

class CustomTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.num_cols = None
        self.cat_cols = None
        self.scaler = StandardScaler()
        self.encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')

    def fit(self, X, y=None):
        # Identify numerical and categorical columns
        self.num_cols = X.select_dtypes(include=[np.number]).columns.tolist()
        self.cat_cols = X.select_dtypes(include=['object', 'category']).columns.tolist()
        # Fit scaler on numerical columns
        if self.num_cols:
            self.scaler.fit(X[self.num_cols])
        # Fit encoder on categorical columns
        if self.cat_cols:
            self.encoder.fit(X[self.cat_cols])
        return self

    def transform(self, X):
        X_num = self.scaler.transform(X[self.num_cols]) if self.num_cols else np.empty((len(X), 0))
        X_cat = self.encoder.transform(X[self.cat_cols]) if self.cat_cols else np.empty((len(X), 0))
        # Combine numerical and categorical features
        return np.hstack([X_num, X_cat])

# Example usage:
if __name__ == '__main__':
    data = pd.DataFrame({
        'age': [25, 32, 47, 51],
        'income': [50000, 64000, 120000, 110000],
        'city': ['New York', 'Paris', 'Paris', 'London'],
        'gender': ['M', 'F', 'F', 'M']
    })

    transformer = CustomTransformer()
    transformer.fit(data)
    transformed_data = transformer.transform(data)

    print('Original data:')
    print(data)
    print('\nTransformed data:')
    print(transformed_data)

Created a CustomTransformer class inheriting from BaseEstimator and TransformerMixin.

Inside the transformer, identified numerical and categorical columns automatically.

Used StandardScaler to scale numerical columns and OneHotEncoder to encode categorical columns.

Implemented fit method to fit scaler and encoder on training data.

Implemented transform method to apply scaling and encoding and combine results.

Provided example usage showing original and transformed data.

Results Interpretation

Before: Separate scaling and encoding steps outside a single transformer, making the pipeline complex and hard to reuse.

After: A single custom transformer handles both scaling and encoding, producing a clean numeric array ready for modeling.

Custom transformers let you bundle multiple preprocessing steps into one reusable unit. This makes your machine learning pipelines cleaner, easier to maintain, and less error-prone.

Bonus Experiment

Extend the custom transformer to handle missing values by imputing the mean for numerical columns and the most frequent category for categorical columns before scaling and encoding.

💡 Hint

Use SimpleImputer from sklearn.impute inside your custom transformer to fill missing values during fit and transform.

Practice

(1/5)

1. What is the main purpose of creating a custom transformer in machine learning pipelines?

easy

A. To train a machine learning model directly

B. To define a reusable data processing step with fit and transform methods

C. To visualize data distributions

D. To store the final predictions of a model

Custom transformers in ML Python - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand the role of transformers

Step 2: Identify the purpose of custom transformers

Final Answer:

Quick Check:

Solution

Step 1: Recall inheritance for custom transformers

Step 2: Match correct class definition syntax

Final Answer:

Quick Check:

Solution

Step 1: Understand transform method behavior

Step 2: Calculate transformed data

Final Answer:

Quick Check:

Solution

Step 1: Check input type handling in transform

Step 2: Fix transform to convert input to numpy array

Final Answer:

Quick Check:

Solution

Step 1: Understand fit and transform roles

Step 2: Apply correct sequence in methods

Final Answer:

Quick Check: