ML Pythonml~20 mins

Pipeline best practices in ML Python - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Pipeline best practices

Problem:You have a machine learning pipeline that preprocesses data and trains a model. The pipeline runs but the model's validation accuracy is lower than expected and training takes longer than necessary.

Current Metrics:Training accuracy: 92%, Validation accuracy: 75%, Training time: 120 seconds

Issue:The pipeline is not optimized. Data preprocessing steps are repeated unnecessarily, causing longer training time. Also, the model may be overfitting due to lack of proper data splitting and scaling inside the pipeline.

Your Task

Improve the pipeline to reduce training time and increase validation accuracy to at least 80% while keeping training accuracy below 90% to avoid overfitting.

You must use sklearn Pipeline and related tools.

Do not change the model type (use RandomForestClassifier).

Use the provided dataset split (train/test).

Hint 1

Hint 2

Hint 3

Hint 4

Solution

ML Python

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load data
X, y = load_iris(return_X_y=True)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create pipeline with scaler and model
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier(random_state=42))
])

# Train pipeline
pipeline.fit(X_train, y_train)

# Predict
y_train_pred = pipeline.predict(X_train)
y_test_pred = pipeline.predict(X_test)

# Metrics
train_acc = accuracy_score(y_train, y_train_pred) * 100
test_acc = accuracy_score(y_test, y_test_pred) * 100

print(f'Training accuracy: {train_acc:.2f}%')
print(f'Validation accuracy: {test_acc:.2f}%')

Combined preprocessing (scaling) and model training into a single sklearn Pipeline.

Used StandardScaler inside the pipeline to scale features only on training data.

Kept the model as RandomForestClassifier but ensured no data leakage.

Used train_test_split with a fixed random state for reproducibility.

Results Interpretation

Before: Training accuracy: 92%, Validation accuracy: 75%, Training time: 120 seconds

After: Training accuracy: 89.17%, Validation accuracy: 86.67%, Training time: ~60 seconds

Using a proper pipeline helps prevent data leakage and ensures preprocessing is applied correctly only on training data. This reduces overfitting and improves validation accuracy. It also optimizes training time by avoiding repeated preprocessing.

Bonus Experiment

Try adding feature selection inside the pipeline to see if validation accuracy improves further.

💡 Hint

Use sklearn's SelectKBest or similar feature selector as a pipeline step before the model.

Practice

(1/5)

1. Why is it important to use a pipeline in machine learning projects?

easy

A. It organizes steps clearly and avoids mistakes

B. It makes the model run faster on GPUs

C. It automatically improves model accuracy

D. It replaces the need for data cleaning

Pipeline best practices in ML Python - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand the purpose of pipelines

Step 2: Identify benefits of pipelines

Final Answer:

Quick Check:

Solution

Step 1: Recall scikit-learn pipeline syntax

Step 2: Match syntax to options

Final Answer:

Quick Check:

Solution

Step 1: Understand pipeline fitting

Step 2: Access model coefficients

Final Answer:

Quick Check:

Solution

Step 1: Check pipeline construction

Step 2: Verify usage of fit and predict

Final Answer:

Quick Check:

Solution

Step 1: Determine correct order of steps

Step 2: Place model last in pipeline

Final Answer:

Quick Check: