Practice

(1/5)

1. Why do machine learning pipelines help ensure reproducibility?

easy

A. They organize steps in a fixed order to repeat results easily

B. They make the model run faster by using GPUs

C. They automatically improve model accuracy

D. They reduce the size of the dataset

Solution

Step 1: Understand pipeline structure
Pipelines arrange data processing and model steps in a set order.
Step 2: Link order to reproducibility
This fixed order means running the pipeline again produces the same results.
Final Answer:
They organize steps in a fixed order to repeat results easily -> Option A
Quick Check:
Fixed step order = reproducibility [OK]

Hint: Pipelines fix step order to repeat results [OK]

Common Mistakes:

Thinking pipelines speed up training automatically
Believing pipelines improve accuracy by themselves
Confusing reproducibility with dataset size reduction

2. Which of the following is the correct way to create a pipeline in Python using scikit-learn?

easy

A. pipeline = Pipeline('scale', StandardScaler(), 'model', LogisticRegression())

B. pipeline = Pipeline({'scale': StandardScaler(), 'model': LogisticRegression()})

C. pipeline = Pipeline([('scale', StandardScaler()), ('model', LogisticRegression())])

D. pipeline = Pipeline(StandardScaler(), LogisticRegression())

Solution

Step 1: Recall Pipeline syntax
Pipeline expects a list of tuples with step name and transformer/model.
Step 2: Match syntax to options
pipeline = Pipeline([('scale', StandardScaler()), ('model', LogisticRegression())]) correctly uses a list of tuples; others use wrong formats.
Final Answer:
pipeline = Pipeline([('scale', StandardScaler()), ('model', LogisticRegression())]) -> Option C
Quick Check:
List of (name, step) tuples = correct pipeline syntax [OK]

Hint: Pipeline needs list of (name, step) tuples [OK]

Common Mistakes:

Passing steps as separate arguments instead of list
Using dictionary instead of list of tuples
Omitting step names in pipeline

3. Given this pipeline code, what will be the output of print(pipeline.named_steps['scale'].mean_) after fitting?

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

X = [[1, 2], [3, 4], [5, 6]]
y = [0, 1, 0]
pipeline = Pipeline([('scale', StandardScaler()), ('model', LogisticRegression())])
pipeline.fit(X, y)
print(pipeline.named_steps['scale'].mean_)

medium

A. [3. 4.]

B. [0. 0.]

C. [1. 2.]

D. Error: 'mean_' attribute not found

Solution

Step 1: Understand StandardScaler mean_ attribute
StandardScaler computes mean of each feature during fit and stores in mean_.
Step 2: Calculate mean of X features
Feature 1 mean = (1+3+5)/3 = 3, Feature 2 mean = (2+4+6)/3 = 4.
Final Answer:
[3. 4.] -> Option A
Quick Check:
Feature means = [3, 4] [OK]

Hint: StandardScaler.mean_ stores feature means after fit [OK]

Common Mistakes:

Expecting scaled data instead of mean values
Confusing mean_ with other attributes
Trying to access mean_ before fitting

4. You wrote this pipeline code but get an error when calling pipeline.predict(X_test). What is the likely problem?

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([('scale', StandardScaler()), ('model', LogisticRegression())])
# Missing fit step
predictions = pipeline.predict(X_test)

medium

A. predict() method does not exist for pipelines

B. StandardScaler cannot be used in pipelines

C. LogisticRegression requires more data features

D. You forgot to call pipeline.fit() before predict()

Solution

Step 1: Check pipeline usage
Predict requires the pipeline to be trained first using fit().
Step 2: Identify missing fit call
Code misses pipeline.fit(), so model is not trained, causing error on predict.
Final Answer:
You forgot to call pipeline.fit() before predict() -> Option D
Quick Check:
fit() before predict() = required [OK]

Hint: Always fit pipeline before predict [OK]

Common Mistakes:

Assuming pipeline auto-fits before predict
Thinking StandardScaler is incompatible with pipelines
Believing predict() is not a pipeline method

5. You want to ensure your machine learning experiment is reproducible across different machines. Which pipeline practice helps most with this goal?

hard

A. Train the model outside the pipeline and only use pipeline for scaling

B. Fix the random seed inside pipeline steps and save the pipeline object

C. Use different random seeds each time to test robustness

D. Avoid saving the pipeline to reduce file size

Solution

Step 1: Understand reproducibility needs
Reproducibility requires fixed random seeds and saving the exact pipeline.
Step 2: Evaluate options
Fix the random seed inside pipeline steps and save the pipeline object fixes randomness and saves pipeline, ensuring same results on any machine.
Final Answer:
Fix the random seed inside pipeline steps and save the pipeline object -> Option B
Quick Check:
Fixed seed + saved pipeline = reproducibility [OK]

Hint: Fix seeds and save pipeline for reproducibility [OK]

Common Mistakes:

Changing seeds each run breaks reproducibility
Training outside pipeline loses step order
Not saving pipeline loses exact process

Why pipelines ensure reproducibility in ML Python - Why Metrics Matter

Start learning this pattern below

Practice

Solution

Step 1: Understand pipeline structure

Step 2: Link order to reproducibility

Final Answer:

Quick Check:

Solution

Step 1: Recall Pipeline syntax

Step 2: Match syntax to options

Final Answer:

Quick Check:

Solution

Step 1: Understand StandardScaler mean_ attribute

Step 2: Calculate mean of X features

Final Answer:

Quick Check:

Solution

Step 1: Check pipeline usage

Step 2: Identify missing fit call

Final Answer:

Quick Check:

Solution

Step 1: Understand reproducibility needs

Step 2: Evaluate options

Final Answer:

Quick Check: