Which of the following is the main reason to use a pipeline when building a machine learning model?
Think about how pipelines help organize multiple steps in a machine learning task.
Pipelines help combine preprocessing and model training into a single, repeatable workflow. This ensures consistent data handling and easier experimentation.
What will be the output of the following code snippet?
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression import numpy as np X = np.array([[1, 2], [3, 4], [5, 6]]) y = np.array([0, 1, 0]) pipe = Pipeline([ ('scaler', StandardScaler()), ('clf', LogisticRegression(random_state=42)) ]) pipe.fit(X, y) pred = pipe.predict(np.array([[2, 3]])) print(pred[0])
Consider how the logistic regression model predicts based on the scaled input.
The pipeline scales the input [[2, 3]] and then predicts using the trained logistic regression model. Given the training data, the prediction is class 0.
You want to build a pipeline to classify text messages as spam or not spam. Which step should you add before the classifier to convert text into numbers?
Think about how to convert text data into a format a model can understand.
CountVectorizer converts text into a matrix of token counts, which is suitable for feeding into classifiers. StandardScaler and PCA are for numeric data, and KMeans is a clustering algorithm.
Given a pipeline with a scaler and a random forest classifier named 'clf', how do you set the number of trees (n_estimators) to 100 in the classifier using the pipeline object?
Remember how to access parameters of steps inside a pipeline.
To set parameters of a step inside a pipeline, use the step name followed by two underscores and the parameter name. Here, 'clf__n_estimators' sets the number of trees in the classifier.
Consider this pipeline code:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression())
])
pipe.fit(X_train, y_train)
# Later
X_train_scaled = pipe.named_steps['scaler'].transform(X_train)
X_test_scaled = pipe.named_steps['scaler'].transform(X_test)
model = LogisticRegression()
model.fit(X_train_scaled, y_train)
predictions = model.predict(X_test_scaled)What is the main issue with this approach?
Think about how the pipeline should be used for both training and prediction.
The pipeline is fit on training data, but then the scaler is used separately to transform data and a new logistic regression model is trained outside the pipeline. This breaks the pipeline's consistency and can cause errors or leakage if not careful. The pipeline should be used end-to-end for both training and prediction.