Recursive Feature Elimination (RFE) helps pick the best features for a model. The key metric to watch is the model's performance metric like accuracy, F1 score, or mean squared error after each feature removal step. This shows if removing features helps or hurts the model. We want to keep features that improve or keep performance stable.
Recursive feature elimination in ML Python - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
Imagine a classification model using RFE. After selecting features, we check the confusion matrix:
| Predicted Positive | Predicted Negative |
|--------------------|--------------------|
| True Positive (TP): 50 | False Negative (FN): 10 |
| False Positive (FP): 5 | True Negative (TN): 35 |
Total samples = 50 + 10 + 5 + 35 = 100
Precision = 50 / (50 + 5) = 0.91
Recall = 50 / (50 + 10) = 0.83
F1 Score = 2 * (0.91 * 0.83) / (0.91 + 0.83) ≈ 0.87
These metrics tell us how well the model performs with the chosen features.
When RFE removes features, it can affect precision and recall differently.
- High Precision Needed: For spam detection, we want to avoid marking good emails as spam. So, RFE should keep features that help precision.
- High Recall Needed: For disease detection, missing a sick patient is bad. RFE should keep features that help recall.
RFE helps find the smallest feature set that balances these metrics well.
Good: After RFE, model accuracy or F1 score stays high or improves. For example, accuracy above 90% with fewer features means success.
Bad: Metrics drop a lot after removing features, like accuracy falling from 90% to 70%. This means important features were removed.
- Overfitting: If RFE is done on the whole dataset before splitting, it leaks information and inflates metrics.
- Ignoring Validation: Only checking training accuracy can mislead. Always check metrics on unseen data.
- Accuracy Paradox: High accuracy can hide poor recall or precision if classes are imbalanced.
Your model after RFE has 98% accuracy but only 12% recall on fraud cases. Is it good?
Answer: No. Even with high accuracy, the model misses most fraud cases (low recall). For fraud detection, recall is critical to catch fraud. So, this model is not good for production.
Practice
Recursive Feature Elimination (RFE) in machine learning?Solution
Step 1: Understand the purpose of RFE
RFE works by removing less important features one at a time to keep only the best ones.Step 2: Compare options to the purpose
Only To select the most important features by removing less important ones step by step describes this step-by-step removal of less important features.Final Answer:
To select the most important features by removing less important ones step by step -> Option AQuick Check:
RFE = Stepwise feature removal [OK]
- Thinking RFE adds or creates features
- Confusing RFE with random feature shuffling
- Believing RFE increases feature count
Solution
Step 1: Recall the correct import statement
The class is namedRFEand is insklearn.feature_selection.Step 2: Match options with correct syntax
from sklearn.feature_selection import RFE correctly importsRFEfromsklearn.feature_selection.Final Answer:
from sklearn.feature_selection import RFE -> Option BQuick Check:
Correct import is 'from sklearn.feature_selection import RFE' [OK]
- Using wrong module name like sklearn.selection
- Trying to import full name RecursiveFeatureElimination
- Using incorrect import syntax
print(selected_features)?
from sklearn.datasets import load_iris from sklearn.linear_model import LogisticRegression from sklearn.feature_selection import RFE iris = load_iris() X, y = iris.data, iris.target model = LogisticRegression(max_iter=200) rfe = RFE(model, n_features_to_select=2) rfe.fit(X, y) selected_features = rfe.support_ print(selected_features)
Solution
Step 1: Understand RFE output
Thesupport_support_attribute is a boolean array showing which features are selected.Step 2: Run RFE with LogisticRegression on iris dataset
RFE selects the two most important features, which for iris are the last two features (petal length and petal width), so the output is [False False True True].Final Answer:
[False False True True ] -> Option DQuick Check:
RFE selects last two iris features = [False False True True] [OK]
- Assuming first two features are selected
- Confusing support_ with ranking_
- Not setting max_iter causing convergence warnings
from sklearn.feature_selection import RFE from sklearn.linear_model import LogisticRegression model = LogisticRegression() rfe = RFE(model, n_features_to_select=0) rfe.fit(X, y)
Solution
Step 1: Check parameter
This parameter must be at least 1 or None, zero is invalid.n_features_to_selectStep 2: Identify correct fix
Settingn_features_to_selectto a positive integer fixes the error.Final Answer:
n_features_to_select cannot be zero; set it to a positive integer -> Option AQuick Check:
n_features_to_select > 0 required [OK]
- Setting n_features_to_select to zero
- Wrong import paths for LogisticRegression
- Thinking random_state is mandatory for RFE
df and target in y?Solution
Step 1: Check correct fit method usage
Features (df) must be first argument, target (y) second infit.Step 2: Select features using
Usesupport_boolean maskrfe.support_to get selected features, then map to column names.Final Answer:
Code snippet A correctly fits and selects features using support_ mask -> Option CQuick Check:
fit(df, y) + support_ mask = correct feature selection [OK]
- Swapping X and y in fit method
- Using ranking_ == 5 instead of support_
- Not converting boolean mask to column names
