When optimizing models by quantization or pruning, the key metrics to watch are model accuracy and inference latency. Accuracy tells us if the model still makes good predictions after optimization. Latency shows how fast the model runs, which is often the goal of optimization. We also check model size to see how much memory is saved. Balancing these metrics helps us keep the model useful while making it smaller and faster.
Model optimization (quantization, pruning) in PyTorch - Model Metrics & Evaluation
Original model confusion matrix:
TP=90 FP=10
FN=5 TN=95
After pruning:
TP=85 FP=15
FN=10 TN=90
Total samples = 200
Precision before pruning = 90 / (90 + 10) = 0.9
Recall before pruning = 90 / (90 + 5) = 0.947
Precision after pruning = 85 / (85 + 15) = 0.85
Recall after pruning = 85 / (85 + 10) = 0.895
This shows a slight drop in precision and recall after pruning, which is common but should be minimal.
Quantization and pruning reduce model size and speed up inference but can lower accuracy. For example:
- Quantization: Converts weights from 32-bit floats to 8-bit integers. This shrinks model size and speeds up calculations but may cause small accuracy loss.
- Pruning: Removes less important connections. This reduces size and computation but can remove useful information, lowering accuracy.
We must decide how much accuracy loss is acceptable for the gain in speed and size. For mobile apps, smaller and faster models are often worth a small accuracy drop.
Good:
- Accuracy drop < 1-2% after optimization
- Model size reduced by 50% or more
- Inference latency reduced by 30% or more
Bad:
- Accuracy drops more than 5%
- Minimal size or speed improvement
- Model becomes unstable or unpredictable
- Ignoring accuracy drop: Focusing only on size/speed can break the model.
- Data leakage: Testing on data seen during training can hide accuracy loss.
- Overfitting to optimization: Tweaking too much on test data can give false confidence.
- Not measuring latency on target device: Speed gains on desktop may not appear on mobile.
Your model after pruning has 98% accuracy but recall on the positive class dropped to 12%. Is it good for production? Why or why not?
Answer: No, it is not good. Even though overall accuracy is high, the very low recall means the model misses most positive cases. For example, if detecting fraud, missing 88% of fraud cases is dangerous. High recall is critical in such tasks.