When using DistributedDataParallel (DDP), the main goal is to train a model faster by splitting data across multiple devices. The key metrics to watch are training loss and validation accuracy. These show if the model learns well and generalizes correctly. Also, speedup and scalability metrics matter to check if adding more devices really helps training time without hurting model quality.
DistributedDataParallel in PyTorch - Model Metrics & Evaluation
DistributedDataParallel itself does not change the confusion matrix of the model predictions. However, here is an example confusion matrix from a classification model trained with DDP:
| Predicted Positive | Predicted Negative |
|--------------------|--------------------|
| True Positive (TP) = 90 | False Negative (FN) = 10 |
| False Positive (FP) = 5 | True Negative (TN) = 95 |
Totals: TP + FP + TN + FN = 90 + 5 + 95 + 10 = 200 samples.
Precision = 90 / (90 + 5) = 0.947
Recall = 90 / (90 + 10) = 0.9
F1 Score = 2 * (0.947 * 0.9) / (0.947 + 0.9) ≈ 0.923
In DDP training, the model's precision and recall depend on the data and model, not DDP itself. But DDP helps train faster and on more data, which can improve both.
Example 1: Spam filter - High precision is important to avoid marking good emails as spam. DDP can help train a better model faster.
Example 2: Medical diagnosis - High recall is critical to catch all disease cases. DDP allows training on large datasets to improve recall.
DDP helps by enabling faster training on bigger data, which can improve the balance between precision and recall.
Good metrics after training with DDP:
- Training loss steadily decreases and matches single-device training loss.
- Validation accuracy is similar or better than single-device training.
- Speedup close to the number of devices used (e.g., 4 GPUs -> ~4x faster).
- Consistent precision and recall values without degradation.
Bad metrics:
- Training loss does not decrease or is unstable compared to single-device.
- Validation accuracy drops significantly, indicating poor model quality.
- Speedup is very low, showing overhead or bottlenecks.
- Precision or recall drops, possibly due to synchronization issues.
- Data leakage: If data is not properly split across devices, the model may see the same samples multiple times, inflating accuracy.
- Overfitting: Faster training can cause overfitting if early stopping or validation checks are ignored.
- Synchronization issues: Incorrect gradient averaging can cause training instability and poor metrics.
- Accuracy paradox: High accuracy may hide poor recall or precision on important classes.
- Unequal batch sizes: If batches differ across devices, metrics may be skewed.
Your model trained with DistributedDataParallel has 98% accuracy but only 12% recall on the fraud class. Is it good for production? Why or why not?
Answer: No, it is not good. Although accuracy is high, the very low recall means the model misses most fraud cases. For fraud detection, recall is critical because missing fraud is costly. The model needs improvement to catch more fraud cases before production use.