What is feature engineering in ml in python

MlopsConceptBeginner · 3 min read

Feature Engineering in ML with Python: What It Is and How to Use It

Feature engineering in machine learning with Python is the process of transforming raw data into meaningful features that help models learn better. It involves creating, modifying, or selecting variables using tools like scikit-learn to improve model accuracy and performance.

⚙️

How It Works

Feature engineering is like preparing ingredients before cooking a meal. Raw data is often messy or not directly useful for a machine learning model. By transforming this data into better forms, we help the model understand patterns more easily.

For example, if you have a date of birth, you might create a new feature called "age" because age is more useful for prediction than the raw date. This process can include scaling numbers, encoding categories into numbers, or combining features to create new ones.

In Python, libraries like scikit-learn provide tools to automate many feature engineering steps, making it easier to prepare data for models.

💻

Example

This example shows how to create new features and transform data using scikit-learn. We convert a categorical feature into numbers and scale a numeric feature.

python

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import pandas as pd

# Sample data
data = pd.DataFrame({
    'color': ['red', 'green', 'blue', 'green'],
    'size': [10, 20, 15, 10]
})

# Define which columns to transform
categorical_features = ['color']
numeric_features = ['size']

# Create transformers
categorical_transformer = OneHotEncoder()
numeric_transformer = StandardScaler()

# Combine transformers into a preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_features),
        ('num', numeric_transformer, numeric_features)
    ])

# Apply transformations
transformed_data = preprocessor.fit_transform(data)

# Show transformed data as array
print(transformed_data)

Output

[[0. 0. 1. -1.06904497] [0. 1. 0. 1.33630621] [1. 0. 0. 0.13363062] [0. 1. 0. -1.06904497]]

🎯

When to Use

Use feature engineering whenever your raw data is not ready for a machine learning model. It is especially helpful when:

You have categorical data that needs to be converted to numbers.
Your numeric data has different scales and needs normalization.
You want to create new features that better represent the problem, like extracting date parts or combining variables.

Real-world examples include predicting house prices (creating features like age of house), customer churn (encoding customer categories), or fraud detection (combining transaction features).

✅

Key Points

Feature engineering improves model learning by creating meaningful inputs.
It involves transforming, scaling, encoding, or creating new features.
scikit-learn offers tools like OneHotEncoder and StandardScaler to help automate this.
Good feature engineering can significantly boost model accuracy.

✅

Key Takeaways

Feature engineering transforms raw data into useful features for machine learning models.

Use encoding and scaling to prepare categorical and numeric data respectively.

Python's scikit-learn provides easy-to-use tools for feature engineering.

Creating new features can reveal hidden patterns and improve model accuracy.

Always tailor feature engineering to the specific problem and data you have.