0
0
MlopsHow-ToBeginner · 3 min read

How to Use ColumnTransformer in sklearn with Python

Use ColumnTransformer from sklearn.compose to apply different transformations to specific columns of your data. Define a list of tuples with a name, transformer, and column(s), then fit and transform your data with it.
📐

Syntax

The ColumnTransformer constructor takes a list of tuples, each with three parts:

  • Name: a string to identify the transformer.
  • Transformer: an sklearn transformer like StandardScaler() or OneHotEncoder().
  • Columns: the column names or indices to apply the transformer to.

Example syntax:

python
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

column_transformer = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['age', 'income']),
        ('cat', OneHotEncoder(), ['gender', 'city'])
    ]
)
💻

Example

This example shows how to preprocess numeric columns with scaling and categorical columns with one-hot encoding using ColumnTransformer. It fits and transforms a sample dataset.

python
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Sample data
data = pd.DataFrame({
    'age': [25, 32, 47],
    'income': [50000, 64000, 120000],
    'gender': ['male', 'female', 'female'],
    'city': ['NY', 'LA', 'NY']
})

# Define ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['age', 'income']),
        ('cat', OneHotEncoder(), ['gender', 'city'])
    ]
)

# Fit and transform data
transformed_data = preprocessor.fit_transform(data)

# Show transformed data as array
print(transformed_data)
Output
[[-1.29777137 -1.29777137 1. 0. 1. 0. ] [-0.16222142 -0.16222142 0. 1. 0. 1. ] [ 1.45999279 1.45999279 0. 1. 1. 0. ]]
⚠️

Common Pitfalls

  • Not specifying correct column names or indices causes errors or wrong transformations.
  • Mixing pandas DataFrame columns and numpy arrays without matching columns can confuse ColumnTransformer.
  • For categorical data, forgetting to set handle_unknown='ignore' in OneHotEncoder can cause errors on unseen categories.
  • Not calling fit or fit_transform before transforming data will raise errors.

Example of a common mistake and fix:

python
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Wrong: missing columns or wrong column names
try:
    ct_wrong = ColumnTransformer(
        transformers=[('cat', OneHotEncoder(), ['wrong_column'])]
    )
    ct_wrong.fit_transform(data)
except Exception as e:
    print(f'Error: {e}')

# Right: use correct column names
ct_right = ColumnTransformer(
    transformers=[('cat', OneHotEncoder(handle_unknown='ignore'), ['gender'])]
)
ct_right.fit_transform(data)
Output
Error: Specified columns are not all present in the DataFrame array([[1., 0.], [0., 1.], [0., 1.]])
📊

Quick Reference

ParameterDescription
transformersList of tuples (name, transformer, columns) defining transformations
remainder'drop' (default) drops other columns, 'passthrough' keeps them
sparse_thresholdThreshold to decide if output is sparse matrix
n_jobsNumber of parallel jobs to run
transformer_weightsWeights for each transformer output when combining

Key Takeaways

Use ColumnTransformer to apply different preprocessing steps to different columns in one step.
Always specify correct column names or indices matching your data.
Fit the ColumnTransformer before transforming data to avoid errors.
For categorical data, set handle_unknown='ignore' in OneHotEncoder to handle new categories safely.
Use the remainder parameter to control what happens to columns not listed in transformers.