Good documentation helps everyone understand and trust a machine learning model. While documentation itself is not a numeric metric like accuracy, the quality of documentation can be measured by how well it explains the model's purpose, data, training process, and evaluation metrics. Clear documentation ensures that metrics like accuracy, precision, and recall are interpreted correctly and the model is used properly.
Documentation best practices in ML Python - Model Metrics & Evaluation
Documentation should include clear visualizations like confusion matrices to show model performance. For example, a confusion matrix looks like this:
| Predicted Positive | Predicted Negative |
|--------------------|--------------------|
| True Positive (TP) | False Positive (FP) |
| False Negative (FN) | True Negative (TN) |
This helps users understand errors and strengths of the model.
Documentation should explain tradeoffs like precision vs recall with examples. For instance:
- Spam filter: High precision is important to avoid marking good emails as spam.
- Cancer detection: High recall is critical to catch as many cancer cases as possible.
Explaining these tradeoffs helps users choose the right metric for their needs.
Good documentation clearly states what metric values mean. For example:
- Good: Precision = 0.9 means 90% of predicted positives are correct.
- Bad: Saying "precision measures how many relevant items were found" (this is recall).
- Good: Explaining that 0.5 AUC means random guessing, 1.0 means perfect model.
- Bad: Confusing accuracy with recall or precision.
- Accuracy paradox: High accuracy can be misleading if data is imbalanced.
- Data leakage: Using future data in training can inflate metrics falsely.
- Overfitting indicators: Very high training accuracy but low test accuracy means model memorizes data, not generalizes.
- Misinterpretation: Confusing precision and recall or using wrong formulas.
Your model has 98% accuracy but 12% recall on fraud cases. Is it good for production? Why or why not?
Answer: No, it is not good. The low recall means the model misses most fraud cases, which is dangerous. High accuracy is misleading because fraud cases are rare. The model needs better recall to catch fraud.