How to Use Pipeline with ColumnTransformer in Python sklearn
Use
ColumnTransformer to apply different preprocessing steps to specific columns, then include it inside a Pipeline to chain preprocessing and model training. This lets you clean and transform data in one step before fitting a model.Syntax
The ColumnTransformer applies transformers to specified columns. The Pipeline chains multiple steps like preprocessing and modeling.
Syntax pattern:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
preprocessor = ColumnTransformer(
transformers=[
('name1', transformer1, [columns1]),
('name2', transformer2, [columns2])
])
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('model', model_instance)
])Here, transformer1 and transformer2 are preprocessing steps like scalers or encoders. model_instance is the machine learning model.
python
from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.linear_model import LogisticRegression preprocessor = ColumnTransformer( transformers=[ ('num', StandardScaler(), ['age', 'income']), ('cat', OneHotEncoder(), ['gender', 'city']) ]) pipeline = Pipeline(steps=[ ('preprocessor', preprocessor), ('classifier', LogisticRegression()) ])
Example
This example shows how to preprocess numeric and categorical columns differently using ColumnTransformer inside a Pipeline. It fits a logistic regression model on sample data.
python
import numpy as np import pandas as pd from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # Sample data data = pd.DataFrame({ 'age': [25, 32, 47, 51, 62], 'income': [50000, 60000, 80000, 72000, 90000], 'gender': ['M', 'F', 'F', 'M', 'F'], 'city': ['NY', 'LA', 'NY', 'LA', 'NY'], 'target': [0, 1, 0, 1, 0] }) X = data.drop('target', axis=1) y = data['target'] # Define column transformer preprocessor = ColumnTransformer( transformers=[ ('num', StandardScaler(), ['age', 'income']), ('cat', OneHotEncoder(), ['gender', 'city']) ]) # Create pipeline with preprocessing and model pipeline = Pipeline(steps=[ ('preprocessor', preprocessor), ('classifier', LogisticRegression()) ]) # Split data X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42) # Train model pipeline.fit(X_train, y_train) # Predict and evaluate preds = pipeline.predict(X_test) acc = accuracy_score(y_test, preds) print(f"Accuracy: {acc:.2f}")
Output
Accuracy: 1.00
Common Pitfalls
- Not specifying correct column names or indices in
ColumnTransformercauses errors or wrong transformations. - Forgetting to include the
ColumnTransformerinside thePipelinecan lead to separate preprocessing and model steps, losing automation. - Using incompatible transformers for column data types (e.g., applying scaler to categorical data) causes errors.
- Not fitting the pipeline on training data before predicting causes errors.
python
from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.linear_model import LogisticRegression # Wrong: applying scaler to categorical column preprocessor_wrong = ColumnTransformer( transformers=[ ('num', StandardScaler(), ['gender']) # gender is categorical ]) pipeline_wrong = Pipeline(steps=[ ('preprocessor', preprocessor_wrong), ('model', LogisticRegression()) ]) # Correct: use OneHotEncoder for categorical from sklearn.preprocessing import OneHotEncoder preprocessor_right = ColumnTransformer( transformers=[ ('cat', OneHotEncoder(), ['gender']) ]) pipeline_right = Pipeline(steps=[ ('preprocessor', preprocessor_right), ('model', LogisticRegression()) ])
Quick Reference
Tips for using Pipeline with ColumnTransformer:
- Use
ColumnTransformerto apply different preprocessing to different columns. - Put
ColumnTransformeras the first step in yourPipeline. - Always fit the pipeline on training data, then use it to transform or predict.
- Check column names and data types carefully before assigning transformers.
- Use transformers like
StandardScalerfor numeric andOneHotEncoderfor categorical data.
Key Takeaways
Use ColumnTransformer to preprocess different columns with appropriate transformers.
Include ColumnTransformer inside a Pipeline to automate preprocessing and modeling steps.
Always fit the pipeline on training data before predicting or transforming.
Check that transformers match the data type of each column to avoid errors.
Use clear column names or indices when specifying transformers in ColumnTransformer.