How to use ColumnTransformer sklearn in python

MlopsHow-ToBeginner · 3 min read

How to Use ColumnTransformer in sklearn with Python

Use ColumnTransformer from sklearn.compose to apply different transformations to specific columns of your data. Define a list of tuples with a name, transformer, and column(s), then fit and transform your data with it.

📐

Syntax

The ColumnTransformer constructor takes a list of tuples, each with three parts:

Name: a string to identify the transformer.
Transformer: an sklearn transformer like StandardScaler() or OneHotEncoder().
Columns: the column names or indices to apply the transformer to.

Example syntax:

python

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

column_transformer = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['age', 'income']),
        ('cat', OneHotEncoder(), ['gender', 'city'])
    ]
)

💻

Example

This example shows how to preprocess numeric columns with scaling and categorical columns with one-hot encoding using ColumnTransformer. It fits and transforms a sample dataset.

python

import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Sample data
data = pd.DataFrame({
    'age': [25, 32, 47],
    'income': [50000, 64000, 120000],
    'gender': ['male', 'female', 'female'],
    'city': ['NY', 'LA', 'NY']
})

# Define ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['age', 'income']),
        ('cat', OneHotEncoder(), ['gender', 'city'])
    ]
)

# Fit and transform data
transformed_data = preprocessor.fit_transform(data)

# Show transformed data as array
print(transformed_data)

Output

[[-1.29777137 -1.29777137 1. 0. 1. 0. ] [-0.16222142 -0.16222142 0. 1. 0. 1. ] [ 1.45999279 1.45999279 0. 1. 1. 0. ]]

⚠️

Common Pitfalls

Not specifying correct column names or indices causes errors or wrong transformations.
Mixing pandas DataFrame columns and numpy arrays without matching columns can confuse ColumnTransformer.
For categorical data, forgetting to set handle_unknown='ignore' in OneHotEncoder can cause errors on unseen categories.
Not calling fit or fit_transform before transforming data will raise errors.

Example of a common mistake and fix:

python

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Wrong: missing columns or wrong column names
try:
    ct_wrong = ColumnTransformer(
        transformers=[('cat', OneHotEncoder(), ['wrong_column'])]
    )
    ct_wrong.fit_transform(data)
except Exception as e:
    print(f'Error: {e}')

# Right: use correct column names
ct_right = ColumnTransformer(
    transformers=[('cat', OneHotEncoder(handle_unknown='ignore'), ['gender'])]
)
ct_right.fit_transform(data)

Output

Error: Specified columns are not all present in the DataFrame array([[1., 0.], [0., 1.], [0., 1.]])

📊

Quick Reference

Parameter	Description
transformers	List of tuples (name, transformer, columns) defining transformations
remainder	'drop' (default) drops other columns, 'passthrough' keeps them
sparse_threshold	Threshold to decide if output is sparse matrix
n_jobs	Number of parallel jobs to run
transformer_weights	Weights for each transformer output when combining

✅

Key Takeaways

Use ColumnTransformer to apply different preprocessing steps to different columns in one step.

Always specify correct column names or indices matching your data.

Fit the ColumnTransformer before transforming data to avoid errors.

For categorical data, set handle_unknown='ignore' in OneHotEncoder to handle new categories safely.

Use the remainder parameter to control what happens to columns not listed in transformers.