0
0
MlopsHow-ToBeginner · 4 min read

How to Use SelectKBest in sklearn for Feature Selection in Python

Use SelectKBest from sklearn.feature_selection to select the top k features based on a scoring function. Initialize it with a score function like f_classif and the number of features k, then fit it to your data and transform your features.
📐

Syntax

The basic syntax for using SelectKBest is:

  • SelectKBest(score_func, k): Creates a selector that picks the top k features based on the score_func.
  • score_func: A function that scores each feature, e.g., f_classif for classification tasks.
  • k: Number of top features to select. Use k='all' to keep all features.
  • Use fit(X, y) to compute scores and transform(X) to reduce features.
python
from sklearn.feature_selection import SelectKBest, f_classif

selector = SelectKBest(score_func=f_classif, k=3)
X_new = selector.fit_transform(X, y)
💻

Example

This example shows how to select the top 2 features from the Iris dataset using SelectKBest with the f_classif scoring function.

python
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, f_classif

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Select top 2 features
selector = SelectKBest(score_func=f_classif, k=2)
X_new = selector.fit_transform(X, y)

print('Original shape:', X.shape)
print('Reduced shape:', X_new.shape)
print('Selected feature indices:', selector.get_support(indices=True))
Output
Original shape: (150, 4) Reduced shape: (150, 2) Selected feature indices: [2 3]
⚠️

Common Pitfalls

Common mistakes when using SelectKBest include:

  • Not fitting the selector with both features X and target y, which is required for scoring functions like f_classif.
  • Choosing k larger than the number of features, which causes an error.
  • Forgetting to transform the data after fitting, so the feature selection is not applied.
  • Using an incompatible scoring function for the task (e.g., regression vs classification).
python
from sklearn.feature_selection import SelectKBest, f_classif

# Wrong: forgetting y in fit
# selector = SelectKBest(score_func=f_classif, k=2)
# selector.fit(X)  # This will raise an error

# Right way:
selector = SelectKBest(score_func=f_classif, k=2)
selector.fit(X, y)
X_new = selector.transform(X)
📊

Quick Reference

ParameterDescription
score_funcFunction to score features (e.g., f_classif, chi2)
kNumber of top features to select (int or 'all')
fit(X, y)Compute scores using features X and target y
transform(X)Reduce X to selected features
get_support()Boolean mask of selected features
get_support(indices=True)Indices of selected features

Key Takeaways

SelectKBest selects top features based on a scoring function and number k.
Always fit SelectKBest with both features and target to compute scores correctly.
Use transform() after fitting to reduce your feature set.
Choose a scoring function that matches your task type (classification or regression).
Check selected feature indices with get_support() to understand which features remain.