MlopsHow-ToBeginner · 4 min read

How to Create New Features from Existing in Python with sklearn

You can create new features from existing ones in Python by applying mathematical operations or combining columns using pandas. For machine learning pipelines, sklearn's FunctionTransformer or custom transformers help automate this process.

📐

Syntax

To create new features from existing data, you typically use pandas operations or sklearn transformers.

pandas: Use column operations like addition, multiplication, or apply functions.
FunctionTransformer: Wrap a custom function to transform data inside sklearn pipelines.

python

from sklearn.preprocessing import FunctionTransformer
import pandas as pd

def add_features(X):
    X_new = X.copy()
    X_new['feature_sum'] = X_new['feature1'] + X_new['feature2']
    X_new['feature_ratio'] = X_new['feature1'] / (X_new['feature2'] + 1e-5)
    return X_new

transformer = FunctionTransformer(add_features)

💻

Example

This example shows how to create new features by adding and dividing existing columns using pandas, then using sklearn's FunctionTransformer to apply the transformation.

python

import pandas as pd
from sklearn.preprocessing import FunctionTransformer

# Sample data
data = pd.DataFrame({
    'feature1': [10, 20, 30],
    'feature2': [1, 2, 3]
})

# Function to create new features
def create_new_features(X):
    X_new = X.copy()
    X_new['sum'] = X_new['feature1'] + X_new['feature2']
    X_new['ratio'] = X_new['feature1'] / (X_new['feature2'] + 1e-5)
    return X_new

# Apply function directly
new_data = create_new_features(data)
print(new_data)

# Using FunctionTransformer in sklearn
transformer = FunctionTransformer(create_new_features)
transformed = transformer.transform(data)
print(transformed)

Output

feature1 feature2 sum ratio 0 10 1 11 9.999900e+00 1 20 2 22 9.999000e+00 2 30 3 33 1.000000e+01 feature1 feature2 sum ratio 0 10 1 11 9.999900e+00 1 20 2 22 9.999000e+00 2 30 3 33 1.000000e+01

⚠️

Common Pitfalls

Common mistakes when creating new features include:

Dividing by zero without protection, causing errors or infinite values.
Modifying the original data instead of working on a copy, which can cause unexpected bugs.
Not updating feature names clearly, making it hard to track new features.
Forgetting to apply the same transformation to test data, leading to inconsistent inputs.

python

import pandas as pd

data = pd.DataFrame({'f1': [1, 2], 'f2': [0, 0]})

# Wrong: division by zero error
# data['ratio'] = data['f1'] / data['f2']  # This will cause error or inf

# Right: add small value to avoid zero division
 data['ratio'] = data['f1'] / (data['f2'] + 1e-5)
print(data)

Output

f1 f2 ratio 0 1 0 100000.0 1 2 0 200000.0

📊

Quick Reference

Tips for creating new features from existing ones:

Use pandas for quick feature creation with column operations.
Wrap feature creation logic in functions for reuse and clarity.
Use sklearn's FunctionTransformer to integrate feature creation into pipelines.
Always handle edge cases like division by zero.
Keep feature names descriptive and consistent.

✅

Key Takeaways

Create new features by combining or transforming existing columns using pandas or sklearn transformers.

Wrap feature creation in functions and use FunctionTransformer to integrate with sklearn pipelines.

Always protect against errors like division by zero when creating new features.

Work on copies of data to avoid modifying original datasets unexpectedly.

Keep feature names clear and consistent for easier tracking and debugging.