How to Select Features in Python Using sklearn
To select features in Python, use
sklearn.feature_selection tools like SelectKBest or RFE. These methods help pick the most important features by scoring or recursively removing less useful ones.Syntax
Feature selection in sklearn typically uses classes like SelectKBest or RFE. You first create a selector object, fit it to your data, then transform your features.
SelectKBest(score_func, k): Selects topkfeatures based on a scoring function.RFE(estimator, n_features_to_select): Recursively removes features using an estimator until the desired number remains.
python
from sklearn.feature_selection import SelectKBest, f_classif from sklearn.feature_selection import RFE from sklearn.linear_model import LogisticRegression # SelectKBest syntax selector = SelectKBest(score_func=f_classif, k=3) selector.fit(X, y) X_new = selector.transform(X) # RFE syntax model = LogisticRegression(max_iter=1000) rfe = RFE(estimator=model, n_features_to_select=3) rfe.fit(X, y) X_rfe = rfe.transform(X)
Example
This example shows how to select the top 3 features from the iris dataset using SelectKBest with the ANOVA F-value scoring function.
python
from sklearn.datasets import load_iris from sklearn.feature_selection import SelectKBest, f_classif # Load data iris = load_iris() X, y = iris.data, iris.target # Select top 3 features selector = SelectKBest(score_func=f_classif, k=3) X_new = selector.fit_transform(X, y) print('Original shape:', X.shape) print('New shape after feature selection:', X_new.shape) print('Selected feature indices:', selector.get_support(indices=True))
Output
Original shape: (150, 4)
New shape after feature selection: (150, 3)
Selected feature indices: [0 2 3]
Common Pitfalls
Common mistakes when selecting features include:
- Not fitting the selector on training data only, causing data leakage.
- Choosing
ktoo large or too small without validation. - Using feature selection methods incompatible with the model or data type.
Always fit selectors on training data and validate the number of features chosen.
python
from sklearn.model_selection import train_test_split from sklearn.datasets import load_iris from sklearn.feature_selection import SelectKBest, f_classif # Wrong: fitting selector on full data before splitting iris = load_iris() X, y = iris.data, iris.target selector = SelectKBest(score_func=f_classif, k=2) X_new = selector.fit_transform(X, y) # Data leakage here # Right: split first, then fit selector only on training data X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42) selector = SelectKBest(score_func=f_classif, k=2) X_train_new = selector.fit_transform(X_train, y_train) X_test_new = selector.transform(X_test)
Quick Reference
| Method | Description | When to Use |
|---|---|---|
| SelectKBest | Selects top k features by scoring function | When you want simple univariate feature selection |
| RFE | Recursively removes least important features using a model | When you want model-based feature selection |
| SelectFromModel | Selects features based on importance weights from a model | When model provides feature importance |
| VarianceThreshold | Removes features with low variance | To remove features with little information |
Key Takeaways
Use sklearn.feature_selection tools like SelectKBest or RFE to pick important features.
Always fit feature selectors on training data to avoid data leakage.
Choose the number of features (k) carefully and validate your choice.
Model-based selectors like RFE work well when you have a strong estimator.
Simple methods like VarianceThreshold help remove useless features quickly.