Feature selection helps pick the most important information from data. This makes models simpler and faster.
Feature selection methods in ML Python
Start learning this pattern below
Jump into concepts and practice - no test required
or
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Introduction
Syntax
ML Python
from sklearn.feature_selection import SelectKBest, chi2 selector = SelectKBest(score_func=chi2, k=3) X_new = selector.fit_transform(X, y)
SelectKBest picks the top k features based on a scoring function.
score_func can be different tests like chi2 for classification.
Examples
ML Python
from sklearn.feature_selection import SelectKBest, f_classif selector = SelectKBest(score_func=f_classif, k=2) X_new = selector.fit_transform(X, y)
ML Python
from sklearn.feature_selection import RFE from sklearn.linear_model import LogisticRegression model = LogisticRegression(max_iter=1000) rfe = RFE(model, n_features_to_select=3) X_new = rfe.fit_transform(X, y)
ML Python
from sklearn.feature_selection import VarianceThreshold selector = VarianceThreshold(threshold=0.1) X_new = selector.fit_transform(X)
Sample Model
This code loads the iris dataset, selects the top 2 features using ANOVA F-value, and prints the results.
ML Python
from sklearn.datasets import load_iris from sklearn.feature_selection import SelectKBest, f_classif # Load example data X, y = load_iris(return_X_y=True) # Select top 2 features using ANOVA F-value selector = SelectKBest(score_func=f_classif, k=2) X_new = selector.fit_transform(X, y) print('Original shape:', X.shape) print('New shape after feature selection:', X_new.shape) print('Selected feature scores:', selector.scores_) print('Selected features mask:', selector.get_support())
Important Notes
Feature selection can improve model speed and reduce overfitting.
Always check if feature selection improves your model by testing.
Some methods need target labels (supervised), others don't (unsupervised).
Summary
Feature selection picks the most useful data features for your model.
Common methods include SelectKBest, RFE, and VarianceThreshold.
Using feature selection can make models simpler, faster, and sometimes more accurate.
Practice
1. Which of the following best describes the purpose of feature selection in machine learning?
easy
Solution
Step 1: Understand feature selection goal
Feature selection aims to pick the most useful features that help the model learn better.Step 2: Evaluate options
Only To choose the most important features to improve model performance correctly states that feature selection chooses important features to improve model performance.Final Answer:
To choose the most important features to improve model performance -> Option AQuick Check:
Feature selection = pick important features [OK]
Hint: Feature selection picks useful features, not random or all [OK]
Common Mistakes:
- Thinking feature selection adds features
- Confusing feature selection with feature engineering
- Believing feature selection changes labels
2. Which Python library provides the
SelectKBest feature selection method?easy
Solution
Step 1: Recall common ML libraries
Scikit-learn is the main library for machine learning tools including feature selection.Step 2: Match method to library
SelectKBest is part of scikit-learn's feature_selection module, not pandas, numpy, or matplotlib.Final Answer:
scikit-learn -> Option BQuick Check:
SelectKBest = scikit-learn [OK]
Hint: SelectKBest is from scikit-learn, not data or plotting libs [OK]
Common Mistakes:
- Choosing pandas because it handles data
- Confusing numpy with ML feature tools
- Selecting matplotlib which is for plotting
3. What will be the output shape of features after applying
VarianceThreshold(threshold=0.1) on a dataset with shape (100, 5) where only 3 features have variance above 0.1?medium
Solution
Step 1: Understand VarianceThreshold effect
VarianceThreshold removes features with variance below the threshold, keeping only those above it.Step 2: Apply to given data
Since 3 features have variance above 0.1, only those 3 remain. The number of samples (100) stays the same.Final Answer:
(100, 3) -> Option DQuick Check:
VarianceThreshold keeps features with variance > threshold [OK]
Hint: Output shape keeps rows, columns = features passing threshold [OK]
Common Mistakes:
- Confusing rows and columns in shape
- Assuming all features remain
- Thinking variance threshold changes sample count
4. Consider this code snippet:
from sklearn.feature_selection import RFE from sklearn.linear_model import LogisticRegression model = LogisticRegression() rfe = RFE(model, n_features_to_select=2) rfe.fit(X, y) selected = rfe.transform(X) print(selected.shape)If
X has shape (50, 4), but the output shape is (50, 4), what is the likely error?medium
Solution
Step 1: Understand RFE usage
RFE must be fitted before calling transform to reduce features.Step 2: Check given code and output
If output shape is unchanged, likely transform was called before fitting or fitting failed.Step 3: Identify cause
Since code shows fitting before transform, but output shape unchanged, the most common cause is that transform was called on unfitted RFE or fit did not complete properly.Final Answer:
RFE was not fitted before transform -> Option CQuick Check:
Fit RFE before transform to reduce features [OK]
Hint: Ensure RFE is fitted before transform [OK]
Common Mistakes:
- Assuming transform always reduces features without fitting
- Ignoring the need to fit RFE
- Thinking model type causes shape issue
5. You have a dataset with 10 features, but 4 are highly correlated and 2 have very low variance. Which feature selection approach best improves model simplicity and speed?
hard
Solution
Step 1: Identify problem features
Low variance features add little info; correlated features add redundancy.Step 2: Choose method to remove both
VarianceThreshold removes low variance features; correlation filter removes redundant correlated features.Step 3: Evaluate options
Apply VarianceThreshold to remove low variance, then use correlation filter to drop correlated features combines both methods to improve simplicity and speed effectively.Final Answer:
Apply VarianceThreshold to remove low variance, then use correlation filter to drop correlated features -> Option AQuick Check:
Remove low variance + correlated features = simpler model [OK]
Hint: Combine variance and correlation filters for best feature reduction [OK]
Common Mistakes:
- Using only one method ignoring other feature issues
- Randomly dropping features without reason
- Keeping all features with RFE without reduction
