How to Use ColumnTransformer in sklearn with Python
Use
ColumnTransformer from sklearn.compose to apply different transformations to specific columns of your data. Define a list of tuples with a name, transformer, and column(s), then fit and transform your data with it.Syntax
The ColumnTransformer constructor takes a list of tuples, each with three parts:
- Name: a string to identify the transformer.
- Transformer: an sklearn transformer like
StandardScaler()orOneHotEncoder(). - Columns: the column names or indices to apply the transformer to.
Example syntax:
python
from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder column_transformer = ColumnTransformer( transformers=[ ('num', StandardScaler(), ['age', 'income']), ('cat', OneHotEncoder(), ['gender', 'city']) ] )
Example
This example shows how to preprocess numeric columns with scaling and categorical columns with one-hot encoding using ColumnTransformer. It fits and transforms a sample dataset.
python
import numpy as np import pandas as pd from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder # Sample data data = pd.DataFrame({ 'age': [25, 32, 47], 'income': [50000, 64000, 120000], 'gender': ['male', 'female', 'female'], 'city': ['NY', 'LA', 'NY'] }) # Define ColumnTransformer preprocessor = ColumnTransformer( transformers=[ ('num', StandardScaler(), ['age', 'income']), ('cat', OneHotEncoder(), ['gender', 'city']) ] ) # Fit and transform data transformed_data = preprocessor.fit_transform(data) # Show transformed data as array print(transformed_data)
Output
[[-1.29777137 -1.29777137 1. 0. 1. 0. ]
[-0.16222142 -0.16222142 0. 1. 0. 1. ]
[ 1.45999279 1.45999279 0. 1. 1. 0. ]]
Common Pitfalls
- Not specifying correct column names or indices causes errors or wrong transformations.
- Mixing pandas DataFrame columns and numpy arrays without matching columns can confuse
ColumnTransformer. - For categorical data, forgetting to set
handle_unknown='ignore'inOneHotEncodercan cause errors on unseen categories. - Not calling
fitorfit_transformbefore transforming data will raise errors.
Example of a common mistake and fix:
python
from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder # Wrong: missing columns or wrong column names try: ct_wrong = ColumnTransformer( transformers=[('cat', OneHotEncoder(), ['wrong_column'])] ) ct_wrong.fit_transform(data) except Exception as e: print(f'Error: {e}') # Right: use correct column names ct_right = ColumnTransformer( transformers=[('cat', OneHotEncoder(handle_unknown='ignore'), ['gender'])] ) ct_right.fit_transform(data)
Output
Error: Specified columns are not all present in the DataFrame
array([[1., 0.],
[0., 1.],
[0., 1.]])
Quick Reference
| Parameter | Description |
|---|---|
| transformers | List of tuples (name, transformer, columns) defining transformations |
| remainder | 'drop' (default) drops other columns, 'passthrough' keeps them |
| sparse_threshold | Threshold to decide if output is sparse matrix |
| n_jobs | Number of parallel jobs to run |
| transformer_weights | Weights for each transformer output when combining |
Key Takeaways
Use ColumnTransformer to apply different preprocessing steps to different columns in one step.
Always specify correct column names or indices matching your data.
Fit the ColumnTransformer before transforming data to avoid errors.
For categorical data, set handle_unknown='ignore' in OneHotEncoder to handle new categories safely.
Use the remainder parameter to control what happens to columns not listed in transformers.