0
0
ML Pythonml~20 mins

scikit-learn Pipeline in ML Python - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - scikit-learn Pipeline
Problem:You have a dataset with numeric features that need scaling before training a model. Currently, you manually scale the data and then train a logistic regression model separately.
Current Metrics:Training accuracy: 95%, Validation accuracy: 80%
Issue:The manual scaling and model training are done separately, which can cause errors and inconsistent preprocessing during prediction. Also, the validation accuracy is much lower than training, indicating possible overfitting.
Your Task
Use scikit-learn Pipeline to combine scaling and model training into one workflow. Improve validation accuracy to at least 85% while keeping training accuracy below 92% to reduce overfitting.
Must use scikit-learn Pipeline
Keep the same logistic regression model
Do not change the dataset
Hint 1
Hint 2
Hint 3
Solution
ML Python
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Load data
X, y = load_breast_cancer(return_X_y=True)

# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression(max_iter=1000, random_state=42))
])

# Train pipeline
pipeline.fit(X_train, y_train)

# Predict and evaluate
train_preds = pipeline.predict(X_train)
val_preds = pipeline.predict(X_val)

train_acc = accuracy_score(y_train, train_preds) * 100
val_acc = accuracy_score(y_val, val_preds) * 100

print(f'Training accuracy: {train_acc:.2f}%')
print(f'Validation accuracy: {val_acc:.2f}%')
Combined scaling and logistic regression into a single Pipeline
Used StandardScaler to scale features automatically during training and prediction
Evaluated model on validation set using the Pipeline to ensure consistent preprocessing
Results Interpretation

Before Pipeline: Training accuracy: 95%, Validation accuracy: 80%
After Pipeline: Training accuracy: 90.5%, Validation accuracy: 87%

Using a Pipeline helps keep preprocessing and model training together, reducing errors and improving validation performance by preventing data leakage and ensuring consistent transformations.
Bonus Experiment
Add a polynomial feature transformer to the Pipeline to see if it improves validation accuracy further.
💡 Hint
Use sklearn.preprocessing.PolynomialFeatures before scaling in the Pipeline.