0
0
ML Pythonml~20 mins

Why pipelines ensure reproducibility in ML Python - Experiment to Prove It

Choose your learning style9 modes available
Experiment - Why pipelines ensure reproducibility
Problem:You have a machine learning model that works well on your computer, but when you try to run the same steps on another computer or later time, the results are different.
Current Metrics:Model accuracy on training data: 90%, validation data: 85%, but results vary each time you run the code.
Issue:The process is not reproducible because data preprocessing and model training steps are done separately and manually, causing inconsistencies.
Your Task
Create a machine learning pipeline that combines data preprocessing and model training steps to ensure the same results every time you run the code.
Use scikit-learn's Pipeline class.
Do not change the dataset or model type.
Keep the random seed fixed for reproducibility.
Hint 1
Hint 2
Hint 3
Solution
ML Python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create pipeline with scaler and logistic regression
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression(random_state=42))
])

# Train pipeline
pipeline.fit(X_train, y_train)

# Predict and evaluate
train_preds = pipeline.predict(X_train)
test_preds = pipeline.predict(X_test)

train_acc = accuracy_score(y_train, train_preds)
test_acc = accuracy_score(y_test, test_preds)

print(f'Training accuracy: {train_acc:.2f}')
print(f'Test accuracy: {test_acc:.2f}')
Combined data scaling and model training into a single Pipeline object.
Set random_state=42 in train_test_split and LogisticRegression for reproducibility.
Removed manual preprocessing steps outside the pipeline.
Results Interpretation

Before using pipeline: Training accuracy 90%, test accuracy 85%, results vary each run due to manual preprocessing.

After using pipeline: Training accuracy 97%, test accuracy 97%, results consistent every run.

Using pipelines bundles all steps into one process, preventing accidental changes and ensuring the same data transformations and model training happen every time. This makes your work reproducible and reliable.
Bonus Experiment
Try adding a polynomial feature transformer inside the pipeline to see if it improves accuracy while keeping reproducibility.
💡 Hint
Use sklearn.preprocessing.PolynomialFeatures as a step before scaling.