How to Use RFE in sklearn for Feature Selection in Python
Use
RFE from sklearn.feature_selection to select important features by recursively removing less important ones. Initialize RFE with an estimator and number of features to select, then fit it on your data to get the selected features.Syntax
The basic syntax of RFE involves creating an instance with an estimator (like a model) and the number of features you want to keep. Then you fit it on your data to perform feature selection.
estimator: The model used to evaluate feature importance (e.g., LogisticRegression).n_features_to_select: Number of features to keep after elimination.step: Number of features to remove at each iteration (default is 1).fit(X, y): Fits the RFE model to your data.
python
from sklearn.feature_selection import RFE from sklearn.linear_model import LogisticRegression # Create the model model = LogisticRegression(max_iter=200) # Create the RFE object and specify number of features to select rfe = RFE(estimator=model, n_features_to_select=3, step=1) # Fit RFE rfe.fit(X, y) # Get mask of selected features selected_features = rfe.support_
Example
This example shows how to use RFE with a logistic regression model on a simple dataset to select the top 3 features.
python
from sklearn.datasets import load_iris from sklearn.feature_selection import RFE from sklearn.linear_model import LogisticRegression # Load sample data iris = load_iris() X = iris.data y = iris.target # Initialize logistic regression model model = LogisticRegression(max_iter=200) # Initialize RFE to select top 3 features rfe = RFE(estimator=model, n_features_to_select=3) # Fit RFE rfe.fit(X, y) # Print selected features mask print('Selected features mask:', rfe.support_) # Print feature ranking (1 means selected) print('Feature ranking:', rfe.ranking_)
Output
Selected features mask: [ True True True False]
Feature ranking: [1 1 1 2]
Common Pitfalls
- Not setting
n_features_to_selectproperly can lead to selecting too few or too many features. - Using an estimator that does not have a
coef_orfeature_importances_attribute will cause errors. - For classification, ensure your target
yis correctly formatted (e.g., no missing values). - Remember to fit the RFE object before accessing selected features.
python
from sklearn.feature_selection import RFE from sklearn.linear_model import LinearRegression # Wrong: LinearRegression does not have coef_ for classification tasks model = LinearRegression() rfe = RFE(estimator=model, n_features_to_select=2) # This will raise an error if used with classification data # Correct approach: use LogisticRegression or tree-based model from sklearn.linear_model import LogisticRegression model = LogisticRegression(max_iter=200) rfe = RFE(estimator=model, n_features_to_select=2) rfe.fit(X, y) # Correct usage
Quick Reference
RFE Quick Tips:
- Use
RFE(estimator, n_features_to_select)to create the selector. - Call
fit(X, y)to perform feature elimination. - Access selected features with
support_(boolean mask). - Check feature ranking with
ranking_(1 means selected). - Choose an estimator with
coef_orfeature_importances_attribute.
Key Takeaways
Use RFE with an estimator that supports feature importance like LogisticRegression.
Set n_features_to_select to control how many features remain after elimination.
Fit the RFE object on your data before accessing selected features.
Check the support_ attribute for selected features and ranking_ for feature order.
Avoid using estimators without coef_ or feature_importances_ attributes with RFE.