We use ColumnTransformer to apply different data changes to different columns in one step. This helps when data has numbers and words mixed together.
ColumnTransformer for mixed types in ML Python
Start learning this pattern below
Jump into concepts and practice - no test required
from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder transformer = ColumnTransformer( transformers=[ ('num', StandardScaler(), ['num_column1', 'num_column2']), ('cat', OneHotEncoder(), ['cat_column1', 'cat_column2']) ] )
Each transformer has a name, a method, and the columns it changes.
Transformers run in parallel and combine results automatically.
ColumnTransformer(
transformers=[
('scale', StandardScaler(), ['age', 'income']),
('encode', OneHotEncoder(), ['city'])
]
)ColumnTransformer(
transformers=[
('num', StandardScaler(), ['height']),
('cat', OneHotEncoder(handle_unknown='ignore'), ['color'])
]
)This example shows how to use ColumnTransformer to scale numbers and encode categories before training a logistic regression model. It splits data, trains, predicts, and shows accuracy.
from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.model_selection import train_test_split import numpy as np import pandas as pd # Sample data with mixed types data = pd.DataFrame({ 'age': [25, 32, 47, 51, 62], 'income': [50000, 60000, 80000, 72000, 90000], 'city': ['New York', 'Paris', 'Paris', 'London', 'New York'], 'target': [0, 1, 0, 1, 0] }) X = data.drop('target', axis=1) y = data['target'] # Define ColumnTransformer preprocessor = ColumnTransformer( transformers=[ ('num', StandardScaler(), ['age', 'income']), ('cat', OneHotEncoder(), ['city']) ] ) # Create a pipeline with preprocessing and model model = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', LogisticRegression())]) # Split data X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42) # Train model model.fit(X_train, y_train) # Predict predictions = model.predict(X_test) # Print results print('Predictions:', predictions) print('Test labels:', y_test.values) print(f'Test accuracy: {model.score(X_test, y_test):.2f}')
ColumnTransformer keeps your data changes clear and easy to manage.
Always match column names exactly when specifying columns.
Use pipelines to combine preprocessing and model training smoothly.
ColumnTransformer lets you change different columns in different ways at once.
It is useful when your data has both numbers and words.
Use it with pipelines to prepare data and train models easily.
Practice
ColumnTransformer in machine learning?Solution
Step 1: Understand the role of ColumnTransformer
ColumnTransformer allows applying different transformations to different columns, such as scaling numbers and encoding text.Step 2: Compare with other options
Training models, visualizing data, or splitting data are different tasks not handled by ColumnTransformer.Final Answer:
To apply different preprocessing steps to different columns in a dataset -> Option BQuick Check:
ColumnTransformer = Different preprocessing per column [OK]
- Confusing ColumnTransformer with model training
- Thinking it splits data instead of transforming
- Assuming it visualizes data
ColumnTransformer from scikit-learn?Solution
Step 1: Recall the module for ColumnTransformer
ColumnTransformer is part of thecomposemodule in scikit-learn.Step 2: Verify other options
Preprocessing, pipeline, and feature_extraction modules do not contain ColumnTransformer.Final Answer:
from sklearn.compose import ColumnTransformer -> Option DQuick Check:
ColumnTransformer is in compose module [OK]
- Importing from preprocessing instead of compose
- Confusing pipeline with compose
- Trying to import from feature_extraction
print(transformed_data)?
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import numpy as np
X = np.array([[1, 'red'], [2, 'blue'], [3, 'green']])
ct = ColumnTransformer([
('num', StandardScaler(), [0]),
('cat', OneHotEncoder(), [1])
])
transformed_data = ct.fit_transform(X)
print(transformed_data)Solution
Step 1: Understand ColumnTransformer setup
Column 0 (numbers) is scaled; column 1 (colors) is one-hot encoded.Step 2: Predict output structure
Output is a numpy array combining scaled numeric values and one-hot encoded categorical values.Final Answer:
A numpy array with scaled numbers and one-hot encoded colors -> Option AQuick Check:
Mixed types transformed correctly = scaled + one-hot [OK]
- Expecting original data without transformation
- Thinking StandardScaler will fail on mixed data
- Ignoring one-hot encoding effect
ColumnTransformer?
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import numpy as np
X = np.array([[1, 'red'], [2, 'blue'], [3, 'green']])
ct = ColumnTransformer([
('num', StandardScaler(), [0, 1]),
('cat', OneHotEncoder(), [1])
])
transformed_data = ct.fit_transform(X)
Solution
Step 1: Check columns assigned to StandardScaler
StandardScaler is applied to columns 0 and 1, but column 1 contains strings.Step 2: Understand why this causes an error
StandardScaler cannot process string data, so this will raise a type error.Final Answer:
StandardScaler is applied to a string column causing an error -> Option AQuick Check:
Scaler on strings = error [OK]
- Applying scaler to categorical columns
- Assuming ColumnTransformer auto-detects types
- Ignoring column indices in transformer
['age', 'income'] and categorical columns ['city', 'gender']. You want to scale numeric columns and one-hot encode categorical columns using ColumnTransformer. Which code snippet correctly sets this up?Solution
Step 1: Identify correct transformers for each column type
Numeric columns should be scaled with StandardScaler; categorical columns should be one-hot encoded.Step 2: Match columns to transformers correctly
ColumnTransformer([('num', StandardScaler(), ['age', 'income']), ('cat', OneHotEncoder(), ['city', 'gender'])]) assigns numeric columns to StandardScaler and categorical columns to OneHotEncoder correctly.Final Answer:
ColumnTransformer([('num', StandardScaler(), ['age', 'income']), ('cat', OneHotEncoder(), ['city', 'gender'])]) -> Option CQuick Check:
Numeric scaled + categorical one-hot = ColumnTransformer([('num', StandardScaler(), ['age', 'income']), ('cat', OneHotEncoder(), ['city', 'gender'])]) [OK]
- Swapping transformers between numeric and categorical
- Mixing columns in wrong transformer
- Leaving out columns
