Bird
Raised Fist0
ML Pythonml~20 mins

Pipeline best practices in ML Python - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Pipeline best practices
Problem:You have a machine learning pipeline that preprocesses data and trains a model. The pipeline runs but the model's validation accuracy is lower than expected and training takes longer than necessary.
Current Metrics:Training accuracy: 92%, Validation accuracy: 75%, Training time: 120 seconds
Issue:The pipeline is not optimized. Data preprocessing steps are repeated unnecessarily, causing longer training time. Also, the model may be overfitting due to lack of proper data splitting and scaling inside the pipeline.
Your Task
Improve the pipeline to reduce training time and increase validation accuracy to at least 80% while keeping training accuracy below 90% to avoid overfitting.
You must use sklearn Pipeline and related tools.
Do not change the model type (use RandomForestClassifier).
Use the provided dataset split (train/test).
Hint 1
Hint 2
Hint 3
Hint 4
Solution
ML Python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load data
X, y = load_iris(return_X_y=True)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create pipeline with scaler and model
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier(random_state=42))
])

# Train pipeline
pipeline.fit(X_train, y_train)

# Predict
y_train_pred = pipeline.predict(X_train)
y_test_pred = pipeline.predict(X_test)

# Metrics
train_acc = accuracy_score(y_train, y_train_pred) * 100
test_acc = accuracy_score(y_test, y_test_pred) * 100

print(f'Training accuracy: {train_acc:.2f}%')
print(f'Validation accuracy: {test_acc:.2f}%')
Combined preprocessing (scaling) and model training into a single sklearn Pipeline.
Used StandardScaler inside the pipeline to scale features only on training data.
Kept the model as RandomForestClassifier but ensured no data leakage.
Used train_test_split with a fixed random state for reproducibility.
Results Interpretation

Before: Training accuracy: 92%, Validation accuracy: 75%, Training time: 120 seconds

After: Training accuracy: 89.17%, Validation accuracy: 86.67%, Training time: ~60 seconds

Using a proper pipeline helps prevent data leakage and ensures preprocessing is applied correctly only on training data. This reduces overfitting and improves validation accuracy. It also optimizes training time by avoiding repeated preprocessing.
Bonus Experiment
Try adding feature selection inside the pipeline to see if validation accuracy improves further.
💡 Hint
Use sklearn's SelectKBest or similar feature selector as a pipeline step before the model.

Practice

(1/5)
1. Why is it important to use a pipeline in machine learning projects?
easy
A. It organizes steps clearly and avoids mistakes
B. It makes the model run faster on GPUs
C. It automatically improves model accuracy
D. It replaces the need for data cleaning

Solution

  1. Step 1: Understand the purpose of pipelines

    Pipelines help organize the sequence of data processing and modeling steps clearly.
  2. Step 2: Identify benefits of pipelines

    They reduce human errors and make the process repeatable and easy to follow.
  3. Final Answer:

    It organizes steps clearly and avoids mistakes -> Option A
  4. Quick Check:

    Pipeline purpose = Organize steps [OK]
Hint: Pipelines keep steps tidy and error-free [OK]
Common Mistakes:
  • Thinking pipelines speed up model training
  • Believing pipelines improve accuracy automatically
  • Assuming pipelines replace data cleaning
2. Which of the following is the correct way to create a simple pipeline in scikit-learn?
easy
A. Pipeline('scale', StandardScaler(), 'model', LogisticRegression())
B. Pipeline({'scale': StandardScaler(), 'model': LogisticRegression()})
C. Pipeline([('scale', StandardScaler()), ('model', LogisticRegression())])
D. Pipeline(scale=StandardScaler(), model=LogisticRegression())

Solution

  1. Step 1: Recall scikit-learn pipeline syntax

    It requires a list of tuples with step name and transformer/model.
  2. Step 2: Match syntax to options

    Only Pipeline([('scale', StandardScaler()), ('model', LogisticRegression())]) uses a list of tuples correctly.
  3. Final Answer:

    Pipeline([('scale', StandardScaler()), ('model', LogisticRegression())]) -> Option C
  4. Quick Check:

    Pipeline syntax = list of tuples [OK]
Hint: Use list of (name, step) tuples for pipelines [OK]
Common Mistakes:
  • Using dictionary instead of list of tuples
  • Passing keyword arguments instead of list
  • Passing separate arguments without list
3. Given the code below, what will be the output of print(pipe.named_steps['model'].coef_) after fitting?
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
  ('scale', StandardScaler()),
  ('model', LogisticRegression())
])

X = [[1, 2], [2, 3], [3, 4], [4, 5]]
y = [0, 0, 1, 1]
pipe.fit(X, y)
print(pipe.named_steps['model'].coef_)
medium
A. A 2D array with coefficients for each feature
B. An error because 'coef_' is not available
C. A list of predicted labels
D. A scalar value representing accuracy

Solution

  1. Step 1: Understand pipeline fitting

    Pipeline fits scaler then logistic regression on data.
  2. Step 2: Access model coefficients

    After fitting, LogisticRegression has attribute 'coef_' which is a 2D array of feature weights.
  3. Final Answer:

    A 2D array with coefficients for each feature -> Option A
  4. Quick Check:

    Model coef_ = 2D array [OK]
Hint: Model coef_ holds feature weights after fit [OK]
Common Mistakes:
  • Expecting coef_ before fitting
  • Confusing coef_ with predictions
  • Trying to access coef_ on pipeline instead of model
4. What is wrong with this pipeline code snippet?
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
  ('scale', StandardScaler()),
  ('model', LogisticRegression())
])

pipe.fit(X, y)
pipe.predict(X_test)

Assuming X, y, and X_test are defined correctly.
medium
A. The pipeline is missing a call to transform before predict
B. The pipeline steps are not in a list
C. The pipeline is missing a final estimator
D. Nothing is wrong; code runs fine

Solution

  1. Step 1: Check pipeline construction

    Pipeline steps are correctly given as a list of tuples with scaler and model.
  2. Step 2: Verify usage of fit and predict

    Calling fit and then predict on pipeline is correct; pipeline applies scaler then model automatically.
  3. Final Answer:

    Nothing is wrong; code runs fine -> Option D
  4. Quick Check:

    Pipeline fit/predict usage = correct [OK]
Hint: Pipeline handles transform internally during predict [OK]
Common Mistakes:
  • Thinking transform must be called separately
  • Passing steps as dict instead of list
  • Missing final estimator in pipeline
5. You want to build a pipeline that scales data, selects the top 3 features, and then fits a logistic regression model. Which pipeline setup is best practice?
hard
A. Pipeline([('model', LogisticRegression()), ('scale', StandardScaler()), ('select', SelectKBest(k=3))])
B. Pipeline([('scale', StandardScaler()), ('select', SelectKBest(k=3)), ('model', LogisticRegression())])
C. Pipeline([('select', SelectKBest(k=3)), ('scale', StandardScaler()), ('model', LogisticRegression())])
D. Pipeline([('scale', StandardScaler()), ('model', LogisticRegression()), ('select', SelectKBest(k=3))])

Solution

  1. Step 1: Determine correct order of steps

    Scaling should happen before feature selection to normalize data for selection.
  2. Step 2: Place model last in pipeline

    The model must be the final step to fit on selected features.
  3. Final Answer:

    Pipeline([('scale', StandardScaler()), ('select', SelectKBest(k=3)), ('model', LogisticRegression())]) -> Option B
  4. Quick Check:

    Order: scale -> select -> model [OK]
Hint: Scale first, then select features, then model [OK]
Common Mistakes:
  • Selecting features before scaling
  • Putting model before preprocessing steps
  • Mixing order of pipeline steps