Bird
Raised Fist0
ML Pythonml~20 mins

Custom transformers in ML Python - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Custom transformers
Problem:You want to preprocess data for a machine learning model by creating a custom transformer that scales numerical features and encodes categorical features. The current pipeline uses separate steps but is not reusable or clean.
Current Metrics:Pipeline runs but is hard to maintain and reuse. No accuracy metric yet as preprocessing is manual and separate.
Issue:The current preprocessing is not modular or reusable. It mixes scaling and encoding outside a single transformer, making the pipeline complex and error-prone.
Your Task
Create a custom transformer class that combines scaling of numerical features and one-hot encoding of categorical features into one reusable transformer. Use it in a pipeline and verify it transforms data correctly.
Use scikit-learn's TransformerMixin and BaseEstimator to build the custom transformer.
Do not use existing combined transformers like ColumnTransformer directly for this task.
The transformer must implement fit and transform methods.
Hint 1
Hint 2
Hint 3
Solution
ML Python
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline

class CustomTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.num_cols = None
        self.cat_cols = None
        self.scaler = StandardScaler()
        self.encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')

    def fit(self, X, y=None):
        # Identify numerical and categorical columns
        self.num_cols = X.select_dtypes(include=[np.number]).columns.tolist()
        self.cat_cols = X.select_dtypes(include=['object', 'category']).columns.tolist()
        # Fit scaler on numerical columns
        if self.num_cols:
            self.scaler.fit(X[self.num_cols])
        # Fit encoder on categorical columns
        if self.cat_cols:
            self.encoder.fit(X[self.cat_cols])
        return self

    def transform(self, X):
        X_num = self.scaler.transform(X[self.num_cols]) if self.num_cols else np.empty((len(X), 0))
        X_cat = self.encoder.transform(X[self.cat_cols]) if self.cat_cols else np.empty((len(X), 0))
        # Combine numerical and categorical features
        return np.hstack([X_num, X_cat])

# Example usage:
if __name__ == '__main__':
    data = pd.DataFrame({
        'age': [25, 32, 47, 51],
        'income': [50000, 64000, 120000, 110000],
        'city': ['New York', 'Paris', 'Paris', 'London'],
        'gender': ['M', 'F', 'F', 'M']
    })

    transformer = CustomTransformer()
    transformer.fit(data)
    transformed_data = transformer.transform(data)

    print('Original data:')
    print(data)
    print('\nTransformed data:')
    print(transformed_data)
Created a CustomTransformer class inheriting from BaseEstimator and TransformerMixin.
Inside the transformer, identified numerical and categorical columns automatically.
Used StandardScaler to scale numerical columns and OneHotEncoder to encode categorical columns.
Implemented fit method to fit scaler and encoder on training data.
Implemented transform method to apply scaling and encoding and combine results.
Provided example usage showing original and transformed data.
Results Interpretation

Before: Separate scaling and encoding steps outside a single transformer, making the pipeline complex and hard to reuse.

After: A single custom transformer handles both scaling and encoding, producing a clean numeric array ready for modeling.

Custom transformers let you bundle multiple preprocessing steps into one reusable unit. This makes your machine learning pipelines cleaner, easier to maintain, and less error-prone.
Bonus Experiment
Extend the custom transformer to handle missing values by imputing the mean for numerical columns and the most frequent category for categorical columns before scaling and encoding.
💡 Hint
Use SimpleImputer from sklearn.impute inside your custom transformer to fill missing values during fit and transform.

Practice

(1/5)
1. What is the main purpose of creating a custom transformer in machine learning pipelines?
easy
A. To train a machine learning model directly
B. To define a reusable data processing step with fit and transform methods
C. To visualize data distributions
D. To store the final predictions of a model

Solution

  1. Step 1: Understand the role of transformers

    Transformers process data by learning parameters in fit and applying changes in transform.
  2. Step 2: Identify the purpose of custom transformers

    Custom transformers let you create your own data processing steps reusable in pipelines.
  3. Final Answer:

    To define a reusable data processing step with fit and transform methods -> Option B
  4. Quick Check:

    Custom transformer = reusable data step [OK]
Hint: Custom transformers handle data prep, not model training [OK]
Common Mistakes:
  • Confusing transformers with models
  • Thinking transformers visualize data
  • Assuming transformers store predictions
2. Which of the following is the correct way to start defining a custom transformer class in Python using scikit-learn?
easy
A. class MyTransformer(Pipeline):
B. class MyTransformer(Model):
C. class MyTransformer(BaseEstimator, TransformerMixin):
D. def MyTransformer():

Solution

  1. Step 1: Recall inheritance for custom transformers

    Custom transformers inherit from BaseEstimator and TransformerMixin to get fit and transform methods.
  2. Step 2: Match correct class definition syntax

    class MyTransformer(BaseEstimator, TransformerMixin): correctly shows class inheritance from BaseEstimator and TransformerMixin.
  3. Final Answer:

    class MyTransformer(BaseEstimator, TransformerMixin): -> Option C
  4. Quick Check:

    Inheritance from BaseEstimator and TransformerMixin = class MyTransformer(BaseEstimator, TransformerMixin): [OK]
Hint: Custom transformers inherit BaseEstimator and TransformerMixin [OK]
Common Mistakes:
  • Using Model or Pipeline as base classes
  • Defining transformer as a function
  • Missing inheritance entirely
3. Given this custom transformer code snippet, what will print(transformed_data) output?
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

class AddConstant(BaseEstimator, TransformerMixin):
    def __init__(self, constant=1):
        self.constant = constant
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X + self.constant

X = np.array([[1, 2], [3, 4]])
transformer = AddConstant(constant=5)
transformed_data = transformer.fit_transform(X)
print(transformed_data)
medium
A. [[6 7] [8 9]]
B. [[1 2] [3 4]]
C. [[5 5] [5 5]]
D. Error: fit_transform method not defined

Solution

  1. Step 1: Understand transform method behavior

    The transform method adds the constant (5) to every element in X.
  2. Step 2: Calculate transformed data

    Original X is [[1,2],[3,4]]. Adding 5 gives [[6,7],[8,9]].
  3. Final Answer:

    [[6 7] [8 9]] -> Option A
  4. Quick Check:

    Adding constant 5 to X = [[6 7] [8 9]] [OK]
Hint: transform adds constant to all elements [OK]
Common Mistakes:
  • Thinking fit_transform is missing
  • Forgetting to add constant
  • Confusing output with original data
4. What is wrong with this custom transformer code?
from sklearn.base import BaseEstimator, TransformerMixin

class MultiplyTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, factor=2):
        self.factor = factor
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X * self.factor

transformer = MultiplyTransformer(factor=3)
result = transformer.transform([1, 2, 3])
print(result)
medium
A. transform method should convert input to numpy array before multiplying
B. fit method is missing a return statement
C. factor should be a list, not an int
D. Class should inherit from Pipeline, not BaseEstimator

Solution

  1. Step 1: Check input type handling in transform

    Input is a list, multiplying list by int repeats list instead of element-wise multiply.
  2. Step 2: Fix transform to convert input to numpy array

    Converting input to numpy array allows element-wise multiplication as intended.
  3. Final Answer:

    transform method should convert input to numpy array before multiplying -> Option A
  4. Quick Check:

    List * int repeats list, need numpy array for element-wise multiply [OK]
Hint: Use numpy arrays for element-wise math in transform [OK]
Common Mistakes:
  • Assuming list * int does element-wise multiply
  • Missing return in fit method (actually present)
  • Wrong base class inheritance
5. You want to create a custom transformer that replaces missing values in a dataset with the median of each column, then scales the data by dividing by the max value per column. Which approach correctly combines these steps in one transformer?
hard
A. In fit, replace missing values; in transform, compute medians and max values
B. Use two separate transformers instead of one custom transformer
C. Only implement transform method to do all steps without fit
D. In fit, compute medians and max values; in transform, replace missing with medians and divide by max values

Solution

  1. Step 1: Understand fit and transform roles

    fit calculates statistics (median, max) from training data; transform applies these to new data.
  2. Step 2: Apply correct sequence in methods

    In fit, compute medians and max values; in transform, replace missing with medians and divide by max values correctly computes medians and max in fit, then replaces missing and scales in transform.
  3. Final Answer:

    In fit, compute medians and max values; in transform, replace missing with medians and divide by max values -> Option D
  4. Quick Check:

    fit learns stats, transform applies them [OK]
Hint: fit learns stats; transform applies them to data [OK]
Common Mistakes:
  • Doing data replacement in fit instead of transform
  • Skipping fit method
  • Using separate transformers unnecessarily