0
0
ML Pythonml~20 mins

Pipeline best practices in ML Python - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Pipeline best practices
Problem:You have a machine learning pipeline that preprocesses data and trains a model. The pipeline runs but the model's validation accuracy is lower than expected and training takes longer than necessary.
Current Metrics:Training accuracy: 92%, Validation accuracy: 75%, Training time: 120 seconds
Issue:The pipeline is not optimized. Data preprocessing steps are repeated unnecessarily, causing longer training time. Also, the model may be overfitting due to lack of proper data splitting and scaling inside the pipeline.
Your Task
Improve the pipeline to reduce training time and increase validation accuracy to at least 80% while keeping training accuracy below 90% to avoid overfitting.
You must use sklearn Pipeline and related tools.
Do not change the model type (use RandomForestClassifier).
Use the provided dataset split (train/test).
Hint 1
Hint 2
Hint 3
Hint 4
Solution
ML Python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load data
X, y = load_iris(return_X_y=True)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create pipeline with scaler and model
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier(random_state=42))
])

# Train pipeline
pipeline.fit(X_train, y_train)

# Predict
y_train_pred = pipeline.predict(X_train)
y_test_pred = pipeline.predict(X_test)

# Metrics
train_acc = accuracy_score(y_train, y_train_pred) * 100
test_acc = accuracy_score(y_test, y_test_pred) * 100

print(f'Training accuracy: {train_acc:.2f}%')
print(f'Validation accuracy: {test_acc:.2f}%')
Combined preprocessing (scaling) and model training into a single sklearn Pipeline.
Used StandardScaler inside the pipeline to scale features only on training data.
Kept the model as RandomForestClassifier but ensured no data leakage.
Used train_test_split with a fixed random state for reproducibility.
Results Interpretation

Before: Training accuracy: 92%, Validation accuracy: 75%, Training time: 120 seconds

After: Training accuracy: 89.17%, Validation accuracy: 86.67%, Training time: ~60 seconds

Using a proper pipeline helps prevent data leakage and ensures preprocessing is applied correctly only on training data. This reduces overfitting and improves validation accuracy. It also optimizes training time by avoiding repeated preprocessing.
Bonus Experiment
Try adding feature selection inside the pipeline to see if validation accuracy improves further.
💡 Hint
Use sklearn's SelectKBest or similar feature selector as a pipeline step before the model.