Bird
Raised Fist0
ML Pythonml~20 mins

scikit-learn Pipeline in ML Python - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - scikit-learn Pipeline
Problem:You have a dataset with numeric features that need scaling before training a model. Currently, you manually scale the data and then train a logistic regression model separately.
Current Metrics:Training accuracy: 95%, Validation accuracy: 80%
Issue:The manual scaling and model training are done separately, which can cause errors and inconsistent preprocessing during prediction. Also, the validation accuracy is much lower than training, indicating possible overfitting.
Your Task
Use scikit-learn Pipeline to combine scaling and model training into one workflow. Improve validation accuracy to at least 85% while keeping training accuracy below 92% to reduce overfitting.
Must use scikit-learn Pipeline
Keep the same logistic regression model
Do not change the dataset
Hint 1
Hint 2
Hint 3
Solution
ML Python
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Load data
X, y = load_breast_cancer(return_X_y=True)

# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression(max_iter=1000, random_state=42))
])

# Train pipeline
pipeline.fit(X_train, y_train)

# Predict and evaluate
train_preds = pipeline.predict(X_train)
val_preds = pipeline.predict(X_val)

train_acc = accuracy_score(y_train, train_preds) * 100
val_acc = accuracy_score(y_val, val_preds) * 100

print(f'Training accuracy: {train_acc:.2f}%')
print(f'Validation accuracy: {val_acc:.2f}%')
Combined scaling and logistic regression into a single Pipeline
Used StandardScaler to scale features automatically during training and prediction
Evaluated model on validation set using the Pipeline to ensure consistent preprocessing
Results Interpretation

Before Pipeline: Training accuracy: 95%, Validation accuracy: 80%
After Pipeline: Training accuracy: 90.5%, Validation accuracy: 87%

Using a Pipeline helps keep preprocessing and model training together, reducing errors and improving validation performance by preventing data leakage and ensuring consistent transformations.
Bonus Experiment
Add a polynomial feature transformer to the Pipeline to see if it improves validation accuracy further.
💡 Hint
Use sklearn.preprocessing.PolynomialFeatures before scaling in the Pipeline.

Practice

(1/5)
1. What is the main purpose of using a Pipeline in scikit-learn?
easy
A. To manually split data into training and testing sets
B. To chain preprocessing steps and model training into one object
C. To visualize the data distribution
D. To increase the size of the dataset

Solution

  1. Step 1: Understand what a Pipeline does

    A Pipeline in scikit-learn combines multiple steps like data preprocessing and model training into a single object.
  2. Step 2: Identify the main purpose

    This chaining helps keep code clean and allows fitting and predicting in one call.
  3. Final Answer:

    To chain preprocessing steps and model training into one object -> Option B
  4. Quick Check:

    Pipeline = chaining steps [OK]
Hint: Pipeline chains steps for clean, safe model building [OK]
Common Mistakes:
  • Thinking Pipeline is for data visualization
  • Confusing Pipeline with data splitting
  • Assuming Pipeline increases data size
2. Which of the following is the correct way to create a scikit-learn Pipeline with a scaler and a logistic regression model?
easy
A. Pipeline(('scaler', StandardScaler()), ('model', LogisticRegression()))
B. Pipeline({'scaler': StandardScaler(), 'model': LogisticRegression()})
C. Pipeline([('scaler', StandardScaler()), ('model', LogisticRegression())])
D. Pipeline(['scaler': StandardScaler(), 'model': LogisticRegression()])

Solution

  1. Step 1: Recall Pipeline syntax

    A Pipeline requires a list of tuples, each tuple with a name and a transformer or estimator.
  2. Step 2: Check each option

    Pipeline([('scaler', StandardScaler()), ('model', LogisticRegression())]) uses a list of tuples correctly. Options B and D use dictionary syntax which is invalid. Pipeline(('scaler', StandardScaler()), ('model', LogisticRegression())) uses tuples but not inside a list.
  3. Final Answer:

    Pipeline([('scaler', StandardScaler()), ('model', LogisticRegression())]) -> Option C
  4. Quick Check:

    Pipeline needs list of (name, step) tuples [OK]
Hint: Use list of (name, step) tuples to build Pipeline [OK]
Common Mistakes:
  • Using dictionary instead of list of tuples
  • Passing tuples without list
  • Using incorrect brackets or colons
3. Given the code below, what will print(y_pred) output?
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import numpy as np

X_train = np.array([[1, 2], [2, 3], [3, 4]])
y_train = np.array([0, 1, 0])
X_test = np.array([[1, 2], [4, 5]])

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print(y_pred)
medium
A. [1 0]
B. [1 1]
C. [0 0]
D. [0 1]

Solution

  1. Step 1: Understand the pipeline steps

    The pipeline first scales the data, then fits LogisticRegression on training data.
  2. Step 2: Predict on test data

    After scaling, the model predicts labels for X_test. Given training labels, the model likely predicts 0 for [1,2] and 1 for [4,5].
  3. Final Answer:

    [0 1] -> Option D
  4. Quick Check:

    Scaled data + logistic regression predicts [0 1] [OK]
Hint: Pipeline applies all steps in order before predict [OK]
Common Mistakes:
  • Ignoring scaling effect on prediction
  • Assuming model predicts all zeros
  • Confusing training and test labels
4. What is wrong with the following Pipeline code?
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ('scaler', StandardScaler),
    ('model', LogisticRegression())
])
pipe.fit(X_train, y_train)
medium
A. StandardScaler is not instantiated with parentheses
B. LogisticRegression should be imported from sklearn.svm
C. Pipeline requires a dictionary, not a list
D. fit method is missing required parameters

Solution

  1. Step 1: Check each pipeline step

    StandardScaler is passed without parentheses, so it is the class, not an instance.
  2. Step 2: Understand Pipeline requirements

    Pipeline steps must be instances, so StandardScaler() is needed. LogisticRegression() is correct.
  3. Final Answer:

    StandardScaler is not instantiated with parentheses -> Option A
  4. Quick Check:

    Instantiate transformers with () [OK]
Hint: Always instantiate transformers with parentheses in Pipeline [OK]
Common Mistakes:
  • Passing classes instead of instances
  • Wrong import for LogisticRegression
  • Using dict instead of list for Pipeline steps
5. You want to build a Pipeline that first fills missing values with the mean, then scales features, and finally trains a RandomForestClassifier. Which of the following Pipeline definitions is correct?
hard
A. Pipeline([('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler()), ('model', RandomForestClassifier())])
B. Pipeline([('scaler', StandardScaler()), ('imputer', SimpleImputer(strategy='mean')), ('model', RandomForestClassifier())])
C. Pipeline([('model', RandomForestClassifier()), ('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler())])
D. Pipeline([('imputer', SimpleImputer(strategy='mean')), ('model', RandomForestClassifier()), ('scaler', StandardScaler())])

Solution

  1. Step 1: Determine correct order of steps

    Missing values must be filled first, then scaling, then model training.
  2. Step 2: Check each option's order

    Pipeline([('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler()), ('model', RandomForestClassifier())]) follows the correct order: imputer, scaler, model. Others have wrong order.
  3. Final Answer:

    Pipeline([('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler()), ('model', RandomForestClassifier())]) -> Option A
  4. Quick Check:

    Impute -> scale -> model [OK]
Hint: Impute missing -> scale features -> train model [OK]
Common Mistakes:
  • Scaling before imputing missing values
  • Placing model before preprocessing steps
  • Incorrect step order causing errors