0
0
MlopsHow-ToBeginner · 3 min read

How to Use RFE in sklearn for Feature Selection in Python

Use RFE from sklearn.feature_selection to select important features by recursively removing less important ones. Initialize RFE with an estimator and number of features to select, then fit it on your data to get the selected features.
📐

Syntax

The basic syntax of RFE involves creating an instance with an estimator (like a model) and the number of features you want to keep. Then you fit it on your data to perform feature selection.

  • estimator: The model used to evaluate feature importance (e.g., LogisticRegression).
  • n_features_to_select: Number of features to keep after elimination.
  • step: Number of features to remove at each iteration (default is 1).
  • fit(X, y): Fits the RFE model to your data.
python
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# Create the model
model = LogisticRegression(max_iter=200)

# Create the RFE object and specify number of features to select
rfe = RFE(estimator=model, n_features_to_select=3, step=1)

# Fit RFE
rfe.fit(X, y)

# Get mask of selected features
selected_features = rfe.support_
💻

Example

This example shows how to use RFE with a logistic regression model on a simple dataset to select the top 3 features.

python
from sklearn.datasets import load_iris
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# Load sample data
iris = load_iris()
X = iris.data
y = iris.target

# Initialize logistic regression model
model = LogisticRegression(max_iter=200)

# Initialize RFE to select top 3 features
rfe = RFE(estimator=model, n_features_to_select=3)

# Fit RFE
rfe.fit(X, y)

# Print selected features mask
print('Selected features mask:', rfe.support_)

# Print feature ranking (1 means selected)
print('Feature ranking:', rfe.ranking_)
Output
Selected features mask: [ True True True False] Feature ranking: [1 1 1 2]
⚠️

Common Pitfalls

  • Not setting n_features_to_select properly can lead to selecting too few or too many features.
  • Using an estimator that does not have a coef_ or feature_importances_ attribute will cause errors.
  • For classification, ensure your target y is correctly formatted (e.g., no missing values).
  • Remember to fit the RFE object before accessing selected features.
python
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

# Wrong: LinearRegression does not have coef_ for classification tasks
model = LinearRegression()
rfe = RFE(estimator=model, n_features_to_select=2)

# This will raise an error if used with classification data
# Correct approach: use LogisticRegression or tree-based model

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=200)
rfe = RFE(estimator=model, n_features_to_select=2)
rfe.fit(X, y)  # Correct usage
📊

Quick Reference

RFE Quick Tips:

  • Use RFE(estimator, n_features_to_select) to create the selector.
  • Call fit(X, y) to perform feature elimination.
  • Access selected features with support_ (boolean mask).
  • Check feature ranking with ranking_ (1 means selected).
  • Choose an estimator with coef_ or feature_importances_ attribute.

Key Takeaways

Use RFE with an estimator that supports feature importance like LogisticRegression.
Set n_features_to_select to control how many features remain after elimination.
Fit the RFE object on your data before accessing selected features.
Check the support_ attribute for selected features and ranking_ for feature order.
Avoid using estimators without coef_ or feature_importances_ attributes with RFE.