0
0
ML Pythonml~5 mins

scikit-learn Pipeline in ML Python

Choose your learning style9 modes available
Introduction
A pipeline helps you put multiple steps of data processing and model training together in one easy flow. It makes your work cleaner and less error-prone.
When you want to apply the same data cleaning and model training steps every time.
When you want to avoid mistakes by automating the order of steps.
When you want to test different models or settings easily.
When you want to prepare your data and train your model in one go.
When you want to save your whole process and reuse it later.
Syntax
ML Python
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('step_name1', transformer_or_estimator1),
    ('step_name2', transformer_or_estimator2),
    # ...
])
Each step is a tuple with a name and a transformer or estimator object.
The last step is usually the model (estimator), earlier steps are transformers that change the data.
Examples
This pipeline first scales the data, then trains a logistic regression model.
ML Python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ('scale', StandardScaler()),
    ('model', LogisticRegression())
])
This pipeline fills missing values with the mean, then trains a decision tree.
ML Python
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier

pipeline = Pipeline([
    ('impute', SimpleImputer(strategy='mean')),
    ('model', DecisionTreeClassifier())
])
Sample Model
This program loads the iris flower data, splits it, scales features, trains a logistic regression model, and prints the accuracy on test data.
ML Python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression(max_iter=200))
])

# Train model
pipeline.fit(X_train, y_train)

# Predict
y_pred = pipeline.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Test accuracy: {accuracy:.2f}")
OutputSuccess
Important Notes
You can use pipelines to avoid data leakage by fitting transformers only on training data.
Pipelines make it easy to try different models by swapping the last step.
You can save and load pipelines with joblib to reuse your whole process.
Summary
A scikit-learn Pipeline chains data steps and model training into one object.
It helps keep your code clean and safe from mistakes.
You can fit, predict, and evaluate your model with just the pipeline.