0
0
ML Pythonml~5 mins

ColumnTransformer for mixed types in ML Python

Choose your learning style9 modes available
Introduction

We use ColumnTransformer to apply different data changes to different columns in one step. This helps when data has numbers and words mixed together.

You have a table with some columns as numbers and others as words.
You want to change numbers by scaling and words by turning them into numbers.
You want to prepare data quickly before teaching a computer to learn.
You want to keep your data changes organized and easy to repeat.
Syntax
ML Python
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

transformer = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['num_column1', 'num_column2']),
        ('cat', OneHotEncoder(), ['cat_column1', 'cat_column2'])
    ]
)

Each transformer has a name, a method, and the columns it changes.

Transformers run in parallel and combine results automatically.

Examples
Scale 'age' and 'income' columns, and one-hot encode 'city' column.
ML Python
ColumnTransformer(
    transformers=[
        ('scale', StandardScaler(), ['age', 'income']),
        ('encode', OneHotEncoder(), ['city'])
    ]
)
Scale 'height' and encode 'color' with safe handling of new categories.
ML Python
ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['height']),
        ('cat', OneHotEncoder(handle_unknown='ignore'), ['color'])
    ]
)
Sample Model

This example shows how to use ColumnTransformer to scale numbers and encode categories before training a logistic regression model. It splits data, trains, predicts, and shows accuracy.

ML Python
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

# Sample data with mixed types
data = pd.DataFrame({
    'age': [25, 32, 47, 51, 62],
    'income': [50000, 60000, 80000, 72000, 90000],
    'city': ['New York', 'Paris', 'Paris', 'London', 'New York'],
    'target': [0, 1, 0, 1, 0]
})

X = data.drop('target', axis=1)
y = data['target']

# Define ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['age', 'income']),
        ('cat', OneHotEncoder(), ['city'])
    ]
)

# Create a pipeline with preprocessing and model
model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('classifier', LogisticRegression())])

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Train model
model.fit(X_train, y_train)

# Predict
predictions = model.predict(X_test)

# Print results
print('Predictions:', predictions)
print('Test labels:', y_test.values)
print(f'Test accuracy: {model.score(X_test, y_test):.2f}')
OutputSuccess
Important Notes

ColumnTransformer keeps your data changes clear and easy to manage.

Always match column names exactly when specifying columns.

Use pipelines to combine preprocessing and model training smoothly.

Summary

ColumnTransformer lets you change different columns in different ways at once.

It is useful when your data has both numbers and words.

Use it with pipelines to prepare data and train models easily.