When saving machine learning pipelines using joblib or pickle, the key metric is model integrity. This means the saved pipeline should load back exactly as it was, preserving all steps and parameters so predictions remain the same. We check this by comparing predictions before saving and after loading. Accuracy or other performance metrics should not change. This ensures the pipeline is saved correctly and can be reused without errors.
Saving pipelines (joblib, pickle) in ML Python - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
Since saving pipelines is about preserving model behavior, we verify by comparing predictions before and after saving. For example, if the model predicts labels for 10 samples, the confusion matrix before saving and after loading should be identical.
Before saving predictions: [1, 0, 1, 1, 0, 0, 1, 0, 1, 0]
After loading predictions: [1, 0, 1, 1, 0, 0, 1, 0, 1, 0]
Confusion matrix (same for both):
+-----+-----+
| TP | FP |
+-----+-----+
| FN | TN |
+-----+-----+
TP = 5, FP = 0, FN = 0, TN = 5
Saving pipelines does not directly affect precision or recall. However, if the pipeline is corrupted during saving or loading, predictions may change, causing precision and recall to drop. For example, if a spam filter pipeline is saved incorrectly, it might mark good emails as spam (lower precision) or miss spam emails (lower recall). Thus, ensuring pipeline integrity preserves the original precision and recall.
Good: Predictions before saving and after loading are exactly the same. Accuracy, precision, recall, and F1 score remain unchanged. This means the pipeline was saved and loaded correctly.
Bad: Predictions differ after loading. Metrics drop significantly. This indicates the pipeline was corrupted or not saved properly, making it unreliable for future use.
- Corrupted save/load: Using incompatible versions of joblib or pickle can corrupt the pipeline.
- Data leakage: Saving pipelines that include data-dependent steps (like scaling on full data) without refitting on new data can cause misleading metrics.
- Overfitting: Saving a pipeline that overfits training data will preserve that behavior; metrics may look good on training but fail on new data.
- Accuracy paradox: High accuracy after loading does not guarantee pipeline integrity if the test set is unbalanced or small.
Your model pipeline was saved with joblib. After loading, the accuracy on the test set is 98%, but recall on the positive class dropped from 90% to 12%. Is the saved pipeline good for production? Why or why not?
Answer: No, the saved pipeline is not good. The large drop in recall means the model misses many positive cases after loading. This suggests the pipeline was corrupted or not saved properly. You must fix the saving/loading process to preserve model performance.
Practice
joblib or pickle?Solution
Step 1: Understand what saving a pipeline means
Saving a pipeline stores the trained model and preprocessing steps so you don't have to train again.Step 2: Identify the main benefit
This allows you to reuse the pipeline later for predictions without retraining, saving time and effort.Final Answer:
To reuse the trained model and preprocessing steps without retraining -> Option CQuick Check:
Saving pipeline = reuse trained model [OK]
- Thinking saving speeds up training
- Confusing saving with visualization
- Assuming saving tunes hyperparameters
pipe to a file called model.pkl using joblib?Solution
Step 1: Recall the correct joblib function for saving
The function to save an object with joblib isdump(), not save, write, or store.Step 2: Match the syntax
The correct syntax isjoblib.dump(pipe, 'model.pkl')to save the pipeline to a file.Final Answer:
joblib.dump(pipe, 'model.pkl') -> Option AQuick Check:
Save with joblib.dump() [OK]
- Using joblib.save() which does not exist
- Confusing dump() with write() or store()
- Incorrect argument order
import joblib
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression())
])
pipe.fit([[0, 0], [1, 1]], [0, 1])
joblib.dump(pipe, 'pipe.pkl')
loaded_pipe = joblib.load('pipe.pkl')
pred = loaded_pipe.predict([[2, 2]])
print(pred)Solution
Step 1: Understand the pipeline training
The pipeline is trained on two points: [0,0] labeled 0 and [1,1] labeled 1, so it learns to classify higher values as 1.Step 2: Predict using loaded pipeline
After saving and loading, the pipeline predicts on [2,2], which is closer to class 1, so prediction is [1].Final Answer:
[1] -> Option BQuick Check:
Loaded pipeline predicts class 1 for [2,2] [OK]
- Expecting error due to file handling
- Confusing prediction output format
- Assuming prediction is [0]
loaded_pipe = joblib.load('pipeline.pkl') but got a FileNotFoundError. What is the most likely cause?Solution
Step 1: Understand FileNotFoundError meaning
This error means the file specified does not exist at the given path.Step 2: Identify the most common cause
Usually, the file is missing or the path is wrong, so the filepipeline.pklis not found in the current directory.Final Answer:
The file pipeline.pkl does not exist in the current directory -> Option AQuick Check:
FileNotFoundError = missing file [OK]
- Assuming pipeline not trained causes this error
- Thinking joblib.load syntax is wrong
- Assuming file corruption without checking file presence
[[5, 5]]?Solution
Step 1: Check saving syntax correctness
import joblib joblib.dump(pipeline, 'model.joblib') loaded = joblib.load('model.joblib') pred = loaded.predict([[5, 5]]) print(pred) usesjoblib.dump()correctly to save the pipeline, andjoblib.load()to load it.Step 2: Verify prediction step
After loading, it callspredicton new data correctly and prints the result.Step 3: Identify errors in other options
import pickle pickle.load(pipeline, 'model.pkl') loaded = pickle.load('model.pkl') pred = loaded.predict([[5, 5]]) print(pred) wrongly usespickle.loadto save; import joblib joblib.save(pipeline, 'model.pkl') loaded = joblib.load('model.pkl') pred = loaded.predict([[5, 5]]) print(pred) uses non-existentjoblib.save; import pickle pickle.dump(pipeline, 'model.pkl') loaded = pickle.load('model.pkl') pred = loaded.predict([[5, 5]]) print(pred) incorrectly usespickle.dumpandpickle.load(both require file objects fromopen()with 'wb'/'rb' modes).Final Answer:
import joblib joblib.dump(pipeline, 'model.joblib') loaded = joblib.load('model.joblib') pred = loaded.predict([[5, 5]]) print(pred) -> Option DQuick Check:
Use joblib.dump/load with correct syntax [OK]
- Using joblib.save() which does not exist
- Confusing pickle.load() for saving
- Not opening file when using pickle.load()
