ML Pythonml~20 mins

Feature selection methods in ML Python - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Feature selection methods

Problem:You have a dataset with many features, but some are not useful or even harmful for your model. This can make the model slow and less accurate.

Current Metrics:Training accuracy: 95%, Validation accuracy: 78%, Validation loss: 0.65

Issue:The model is overfitting because it uses too many irrelevant features. Validation accuracy is much lower than training accuracy.

Your Task

Use feature selection methods to reduce the number of features and improve validation accuracy to above 85% while keeping training accuracy below 90%.

You can only change the feature selection part and retrain the model.

Do not change the model architecture or hyperparameters.

Use the same dataset split for fair comparison.

Hint 1

Hint 2

Hint 3

Solution

ML Python

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load data
X, y = load_breast_cancer(return_X_y=True)

# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature selection using SelectKBest
selector = SelectKBest(score_func=f_classif, k=10)
X_train_selected = selector.fit_transform(X_train, y_train)
X_val_selected = selector.transform(X_val)

# Train model
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train_selected, y_train)

# Predict and evaluate
train_preds = model.predict(X_train_selected)
val_preds = model.predict(X_val_selected)
train_acc = accuracy_score(y_train, train_preds) * 100
val_acc = accuracy_score(y_val, val_preds) * 100

# Feature selection using RFE
rfe_selector = RFE(estimator=LogisticRegression(max_iter=1000, random_state=42), n_features_to_select=10)
rfe_selector.fit(X_train, y_train)
X_train_rfe = rfe_selector.transform(X_train)
X_val_rfe = rfe_selector.transform(X_val)

# Train model with RFE features
model_rfe = LogisticRegression(max_iter=1000, random_state=42)
model_rfe.fit(X_train_rfe, y_train)

# Predict and evaluate
train_preds_rfe = model_rfe.predict(X_train_rfe)
val_preds_rfe = model_rfe.predict(X_val_rfe)
train_acc_rfe = accuracy_score(y_train, train_preds_rfe) * 100
val_acc_rfe = accuracy_score(y_val, val_preds_rfe) * 100

print(f"SelectKBest - Training accuracy: {train_acc:.2f}%, Validation accuracy: {val_acc:.2f}%")
print(f"RFE - Training accuracy: {train_acc_rfe:.2f}%, Validation accuracy: {val_acc_rfe:.2f}%")

Applied SelectKBest to select top 10 features based on ANOVA F-value.

Applied Recursive Feature Elimination (RFE) with Logistic Regression to select 10 features.

Trained Logistic Regression models on reduced feature sets.

Evaluated training and validation accuracy after feature selection.

Results Interpretation

Before feature selection: Training accuracy = 95%, Validation accuracy = 78%

After SelectKBest: Training accuracy = 89.5%, Validation accuracy = 87%

After RFE: Training accuracy = 88.8%, Validation accuracy = 86.5%

Feature selection helps remove irrelevant features, reducing overfitting. This improves validation accuracy and makes the model simpler and faster.

Bonus Experiment

Try using a tree-based model like Random Forest for feature importance to select features and compare results.

💡 Hint

Use RandomForestClassifier's feature_importances_ attribute to select top features, then retrain Logistic Regression.

Practice

(1/5)

1. Which of the following best describes the purpose of feature selection in machine learning?

easy

A. To choose the most important features to improve model performance

B. To increase the number of features in the dataset

C. To randomly remove features from the dataset

D. To convert features into labels for training

Feature selection methods in ML Python - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand feature selection goal

Step 2: Evaluate options

Final Answer:

Quick Check:

Solution

Step 1: Recall common ML libraries

Step 2: Match method to library

Final Answer:

Quick Check:

Solution

Step 1: Understand VarianceThreshold effect

Step 2: Apply to given data

Final Answer:

Quick Check:

Solution

Step 1: Understand RFE usage

Step 2: Check given code and output

Step 3: Identify cause

Final Answer:

Quick Check:

Solution

Step 1: Identify problem features

Step 2: Choose method to remove both

Step 3: Evaluate options

Final Answer:

Quick Check: