0
0
ML Pythonml~20 mins

Pipeline best practices in ML Python - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Pipeline Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
Why use a pipeline in machine learning?

Which of the following is the main reason to use a pipeline when building a machine learning model?

ATo reduce the number of features in the dataset by default
BTo increase the size of the training dataset automatically
CTo combine data preprocessing and model training steps into one workflow
DTo make the model run faster by skipping data cleaning
Attempts:
2 left
💡 Hint

Think about how pipelines help organize multiple steps in a machine learning task.

Predict Output
intermediate
2:00remaining
Output of pipeline with scaling and logistic regression

What will be the output of the following code snippet?

ML Python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import numpy as np

X = np.array([[1, 2], [3, 4], [5, 6]])
y = np.array([0, 1, 0])

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression(random_state=42))
])

pipe.fit(X, y)
pred = pipe.predict(np.array([[2, 3]]))
print(pred[0])
A1
B0
CIndexError
DValueError
Attempts:
2 left
💡 Hint

Consider how the logistic regression model predicts based on the scaled input.

Model Choice
advanced
2:00remaining
Choosing the right pipeline step for text data

You want to build a pipeline to classify text messages as spam or not spam. Which step should you add before the classifier to convert text into numbers?

ACountVectorizer()
BPCA()
CStandardScaler()
DKMeans()
Attempts:
2 left
💡 Hint

Think about how to convert text data into a format a model can understand.

Hyperparameter
advanced
2:00remaining
Setting hyperparameters in a pipeline

Given a pipeline with a scaler and a random forest classifier named 'clf', how do you set the number of trees (n_estimators) to 100 in the classifier using the pipeline object?

Apipe.set_params(clf__n_estimators=100)
Bpipe.set_params(n_estimators=100)
Cpipe.clf.n_estimators = 100
Dpipe.set_params(scaler__n_estimators=100)
Attempts:
2 left
💡 Hint

Remember how to access parameters of steps inside a pipeline.

🔧 Debug
expert
3:00remaining
Why does this pipeline cause a data leakage problem?

Consider this pipeline code:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression())
])

pipe.fit(X_train, y_train)

# Later
X_train_scaled = pipe.named_steps['scaler'].transform(X_train)
X_test_scaled = pipe.named_steps['scaler'].transform(X_test)

model = LogisticRegression()
model.fit(X_train_scaled, y_train)

predictions = model.predict(X_test_scaled)

What is the main issue with this approach?

AThe scaler is fit twice, causing data leakage from test data
BThe logistic regression model is trained twice on the same data
CThe scaler is fit only on training data, so no leakage occurs
DThe pipeline is not used for prediction, causing inconsistent preprocessing
Attempts:
2 left
💡 Hint

Think about how the pipeline should be used for both training and prediction.