Feature Engineering in ML with Python: What It Is and How to Use It
features that help models learn better. It involves creating, modifying, or selecting variables using tools like scikit-learn to improve model accuracy and performance.How It Works
Feature engineering is like preparing ingredients before cooking a meal. Raw data is often messy or not directly useful for a machine learning model. By transforming this data into better forms, we help the model understand patterns more easily.
For example, if you have a date of birth, you might create a new feature called "age" because age is more useful for prediction than the raw date. This process can include scaling numbers, encoding categories into numbers, or combining features to create new ones.
In Python, libraries like scikit-learn provide tools to automate many feature engineering steps, making it easier to prepare data for models.
Example
This example shows how to create new features and transform data using scikit-learn. We convert a categorical feature into numbers and scale a numeric feature.
from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline import pandas as pd # Sample data data = pd.DataFrame({ 'color': ['red', 'green', 'blue', 'green'], 'size': [10, 20, 15, 10] }) # Define which columns to transform categorical_features = ['color'] numeric_features = ['size'] # Create transformers categorical_transformer = OneHotEncoder() numeric_transformer = StandardScaler() # Combine transformers into a preprocessor preprocessor = ColumnTransformer( transformers=[ ('cat', categorical_transformer, categorical_features), ('num', numeric_transformer, numeric_features) ]) # Apply transformations transformed_data = preprocessor.fit_transform(data) # Show transformed data as array print(transformed_data)
When to Use
Use feature engineering whenever your raw data is not ready for a machine learning model. It is especially helpful when:
- You have categorical data that needs to be converted to numbers.
- Your numeric data has different scales and needs normalization.
- You want to create new features that better represent the problem, like extracting date parts or combining variables.
Real-world examples include predicting house prices (creating features like age of house), customer churn (encoding customer categories), or fraud detection (combining transaction features).
Key Points
- Feature engineering improves model learning by creating meaningful inputs.
- It involves transforming, scaling, encoding, or creating new features.
scikit-learnoffers tools likeOneHotEncoderandStandardScalerto help automate this.- Good feature engineering can significantly boost model accuracy.