0
0
MlopsHow-ToBeginner · 3 min read

How to Use Pipeline with ColumnTransformer in Python sklearn

Use ColumnTransformer to apply different preprocessing steps to specific columns, then include it inside a Pipeline to chain preprocessing and model training. This lets you clean and transform data in one step before fitting a model.
📐

Syntax

The ColumnTransformer applies transformers to specified columns. The Pipeline chains multiple steps like preprocessing and modeling.

Syntax pattern:

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

preprocessor = ColumnTransformer(
    transformers=[
        ('name1', transformer1, [columns1]),
        ('name2', transformer2, [columns2])
    ])

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', model_instance)
])

Here, transformer1 and transformer2 are preprocessing steps like scalers or encoders. model_instance is the machine learning model.

python
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['age', 'income']),
        ('cat', OneHotEncoder(), ['gender', 'city'])
    ])

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])
💻

Example

This example shows how to preprocess numeric and categorical columns differently using ColumnTransformer inside a Pipeline. It fits a logistic regression model on sample data.

python
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Sample data
data = pd.DataFrame({
    'age': [25, 32, 47, 51, 62],
    'income': [50000, 60000, 80000, 72000, 90000],
    'gender': ['M', 'F', 'F', 'M', 'F'],
    'city': ['NY', 'LA', 'NY', 'LA', 'NY'],
    'target': [0, 1, 0, 1, 0]
})

X = data.drop('target', axis=1)
y = data['target']

# Define column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['age', 'income']),
        ('cat', OneHotEncoder(), ['gender', 'city'])
    ])

# Create pipeline with preprocessing and model
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Train model
pipeline.fit(X_train, y_train)

# Predict and evaluate
preds = pipeline.predict(X_test)
acc = accuracy_score(y_test, preds)
print(f"Accuracy: {acc:.2f}")
Output
Accuracy: 1.00
⚠️

Common Pitfalls

  • Not specifying correct column names or indices in ColumnTransformer causes errors or wrong transformations.
  • Forgetting to include the ColumnTransformer inside the Pipeline can lead to separate preprocessing and model steps, losing automation.
  • Using incompatible transformers for column data types (e.g., applying scaler to categorical data) causes errors.
  • Not fitting the pipeline on training data before predicting causes errors.
python
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression

# Wrong: applying scaler to categorical column
preprocessor_wrong = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['gender'])  # gender is categorical
    ])

pipeline_wrong = Pipeline(steps=[
    ('preprocessor', preprocessor_wrong),
    ('model', LogisticRegression())
])

# Correct: use OneHotEncoder for categorical
from sklearn.preprocessing import OneHotEncoder
preprocessor_right = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), ['gender'])
    ])

pipeline_right = Pipeline(steps=[
    ('preprocessor', preprocessor_right),
    ('model', LogisticRegression())
])
📊

Quick Reference

Tips for using Pipeline with ColumnTransformer:

  • Use ColumnTransformer to apply different preprocessing to different columns.
  • Put ColumnTransformer as the first step in your Pipeline.
  • Always fit the pipeline on training data, then use it to transform or predict.
  • Check column names and data types carefully before assigning transformers.
  • Use transformers like StandardScaler for numeric and OneHotEncoder for categorical data.

Key Takeaways

Use ColumnTransformer to preprocess different columns with appropriate transformers.
Include ColumnTransformer inside a Pipeline to automate preprocessing and modeling steps.
Always fit the pipeline on training data before predicting or transforming.
Check that transformers match the data type of each column to avoid errors.
Use clear column names or indices when specifying transformers in ColumnTransformer.