When using a REST API to get predictions from a machine learning model, the key metrics to watch are latency and accuracy. Latency tells us how fast the API responds, which is important for user experience. Accuracy tells us if the model's predictions are correct. For classification tasks, precision, recall, and F1-score help us understand prediction quality. For regression, mean squared error or mean absolute error matter. We want a balance: fast responses and good prediction quality.
REST API inference in PyTorch - Model Metrics & Evaluation
Imagine a REST API that classifies emails as spam or not spam. Here is a confusion matrix from 100 requests:
| Predicted Spam | Predicted Not Spam |
|----------------|--------------------|
| True Positives (TP) = 40 |
| False Positives (FP) = 10 |
| False Negatives (FN) = 5 |
| True Negatives (TN) = 45 |
Total requests = 40 + 10 + 5 + 45 = 100
From this, we calculate:
- Precision = TP / (TP + FP) = 40 / (40 + 10) = 0.8
- Recall = TP / (TP + FN) = 40 / (40 + 5) = 0.8889
- F1-score = 2 * (Precision * Recall) / (Precision + Recall) ≈ 0.842
- Accuracy = (TP + TN) / Total = (40 + 45) / 100 = 0.85
In REST API inference, sometimes we want to avoid false alarms (false positives), and sometimes we want to catch all true cases (false negatives).
- High Precision Example: A spam filter API where marking a good email as spam is bad. We want high precision to avoid false positives.
- High Recall Example: A medical diagnosis API where missing a disease is dangerous. We want high recall to catch all true cases.
Choosing the right metric depends on the API's purpose and user impact.
For a REST API serving predictions:
- Good: Accuracy above 90%, precision and recall balanced above 85%, latency under 200ms.
- Bad: Accuracy below 70%, precision or recall below 50%, latency over 1 second causing slow user experience.
Good metrics mean users get fast and reliable predictions. Bad metrics mean slow or wrong results, hurting trust.
- Accuracy paradox: High accuracy can be misleading if classes are imbalanced. For example, 95% accuracy if 95% data is one class but model ignores the other.
- Data leakage: If test data leaks into training, API predictions look too good but fail in real use.
- Overfitting: Model performs well on training but poorly on new API requests.
- Ignoring latency: A very accurate model that is too slow makes the API unusable.
Your REST API model has 98% accuracy but only 12% recall on fraud detection. Is it good for production? Why or why not?
Answer: No, it is not good. The low recall means the model misses most fraud cases, which is dangerous. Even with high accuracy, the model fails to catch fraud, so it should be improved before production.