0
0
MlopsHow-ToBeginner · 4 min read

How to Select Features in Python Using sklearn

To select features in Python, use sklearn.feature_selection tools like SelectKBest or RFE. These methods help pick the most important features by scoring or recursively removing less useful ones.
📐

Syntax

Feature selection in sklearn typically uses classes like SelectKBest or RFE. You first create a selector object, fit it to your data, then transform your features.

  • SelectKBest(score_func, k): Selects top k features based on a scoring function.
  • RFE(estimator, n_features_to_select): Recursively removes features using an estimator until the desired number remains.
python
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# SelectKBest syntax
selector = SelectKBest(score_func=f_classif, k=3)
selector.fit(X, y)
X_new = selector.transform(X)

# RFE syntax
model = LogisticRegression(max_iter=1000)
rfe = RFE(estimator=model, n_features_to_select=3)
rfe.fit(X, y)
X_rfe = rfe.transform(X)
💻

Example

This example shows how to select the top 3 features from the iris dataset using SelectKBest with the ANOVA F-value scoring function.

python
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, f_classif

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Select top 3 features
selector = SelectKBest(score_func=f_classif, k=3)
X_new = selector.fit_transform(X, y)

print('Original shape:', X.shape)
print('New shape after feature selection:', X_new.shape)
print('Selected feature indices:', selector.get_support(indices=True))
Output
Original shape: (150, 4) New shape after feature selection: (150, 3) Selected feature indices: [0 2 3]
⚠️

Common Pitfalls

Common mistakes when selecting features include:

  • Not fitting the selector on training data only, causing data leakage.
  • Choosing k too large or too small without validation.
  • Using feature selection methods incompatible with the model or data type.

Always fit selectors on training data and validate the number of features chosen.

python
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, f_classif

# Wrong: fitting selector on full data before splitting
iris = load_iris()
X, y = iris.data, iris.target
selector = SelectKBest(score_func=f_classif, k=2)
X_new = selector.fit_transform(X, y)  # Data leakage here

# Right: split first, then fit selector only on training data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
selector = SelectKBest(score_func=f_classif, k=2)
X_train_new = selector.fit_transform(X_train, y_train)
X_test_new = selector.transform(X_test)
📊

Quick Reference

MethodDescriptionWhen to Use
SelectKBestSelects top k features by scoring functionWhen you want simple univariate feature selection
RFERecursively removes least important features using a modelWhen you want model-based feature selection
SelectFromModelSelects features based on importance weights from a modelWhen model provides feature importance
VarianceThresholdRemoves features with low varianceTo remove features with little information

Key Takeaways

Use sklearn.feature_selection tools like SelectKBest or RFE to pick important features.
Always fit feature selectors on training data to avoid data leakage.
Choose the number of features (k) carefully and validate your choice.
Model-based selectors like RFE work well when you have a strong estimator.
Simple methods like VarianceThreshold help remove useless features quickly.