MlopsHow-ToBeginner · 4 min read

How to Select Features in Python Using sklearn

To select features in Python, use sklearn.feature_selection tools like SelectKBest or RFE. These methods help pick the most important features by scoring or recursively removing less useful ones.

📐

Syntax

Feature selection in sklearn typically uses classes like SelectKBest or RFE. You first create a selector object, fit it to your data, then transform your features.

SelectKBest(score_func, k): Selects top k features based on a scoring function.
RFE(estimator, n_features_to_select): Recursively removes features using an estimator until the desired number remains.

python

from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# SelectKBest syntax
selector = SelectKBest(score_func=f_classif, k=3)
selector.fit(X, y)
X_new = selector.transform(X)

# RFE syntax
model = LogisticRegression(max_iter=1000)
rfe = RFE(estimator=model, n_features_to_select=3)
rfe.fit(X, y)
X_rfe = rfe.transform(X)

💻

Example

This example shows how to select the top 3 features from the iris dataset using SelectKBest with the ANOVA F-value scoring function.

python

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, f_classif

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Select top 3 features
selector = SelectKBest(score_func=f_classif, k=3)
X_new = selector.fit_transform(X, y)

print('Original shape:', X.shape)
print('New shape after feature selection:', X_new.shape)
print('Selected feature indices:', selector.get_support(indices=True))

Output

Original shape: (150, 4) New shape after feature selection: (150, 3) Selected feature indices: [0 2 3]

⚠️

Common Pitfalls

Common mistakes when selecting features include:

Not fitting the selector on training data only, causing data leakage.
Choosing k too large or too small without validation.
Using feature selection methods incompatible with the model or data type.

Always fit selectors on training data and validate the number of features chosen.

python

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, f_classif

# Wrong: fitting selector on full data before splitting
iris = load_iris()
X, y = iris.data, iris.target
selector = SelectKBest(score_func=f_classif, k=2)
X_new = selector.fit_transform(X, y)  # Data leakage here

# Right: split first, then fit selector only on training data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
selector = SelectKBest(score_func=f_classif, k=2)
X_train_new = selector.fit_transform(X_train, y_train)
X_test_new = selector.transform(X_test)

📊

Quick Reference

Method	Description	When to Use
SelectKBest	Selects top k features by scoring function	When you want simple univariate feature selection
RFE	Recursively removes least important features using a model	When you want model-based feature selection
SelectFromModel	Selects features based on importance weights from a model	When model provides feature importance
VarianceThreshold	Removes features with low variance	To remove features with little information

✅

Key Takeaways

Use sklearn.feature_selection tools like SelectKBest or RFE to pick important features.

Always fit feature selectors on training data to avoid data leakage.

Choose the number of features (k) carefully and validate your choice.

Model-based selectors like RFE work well when you have a strong estimator.

Simple methods like VarianceThreshold help remove useless features quickly.