ML Pythonml~8 mins

Flask API for model serving in ML Python - Model Metrics & Evaluation

Choose your learning style9 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Metrics & Evaluation - Flask API for model serving

Which metric matters for Flask API model serving and WHY

When serving a model with a Flask API, the key metrics to watch are latency and accuracy. Latency tells us how fast the API responds to a request, which affects user experience. Accuracy tells us if the model predictions are correct. Both matter because a slow or wrong prediction hurts users.

Confusion matrix example for classification model served by Flask API

      | Predicted Positive | Predicted Negative |
      |--------------------|--------------------|
      | True Positive (TP): 50 | False Negative (FN): 10 |
      | False Positive (FP): 5 | True Negative (TN): 35 |

      Total samples = 50 + 10 + 5 + 35 = 100

From this, we calculate:

Precision = TP / (TP + FP) = 50 / (50 + 5) = 0.91
Recall = TP / (TP + FN) = 50 / (50 + 10) = 0.83
F1 Score = 2 * (Precision * Recall) / (Precision + Recall) ≈ 0.87

Precision vs Recall tradeoff with Flask API example

Imagine your Flask API serves a spam detection model:

High Precision: The API rarely marks good emails as spam. Users trust it because false alarms are low.
High Recall: The API catches almost all spam emails, but may mark some good emails as spam.

Depending on user needs, you might prefer one over the other. For example, if missing spam is worse, prioritize recall. If annoying users with false spam is worse, prioritize precision.

Good vs Bad metric values for Flask API model serving

Good metrics:

Accuracy above 85% for balanced data
Precision and recall both above 80%
API latency under 200 milliseconds

Bad metrics:

Accuracy near random guess (e.g., 50% for binary)
Precision very low (e.g., 30%) causing many false positives
Recall very low (e.g., 20%) missing many true cases
API latency over 1 second causing slow user experience

Common pitfalls in Flask API model serving metrics

Accuracy paradox: High accuracy but poor recall or precision due to imbalanced data.
Data leakage: Model trained on data similar to test data inflates metrics but fails in real API use.
Overfitting: Model performs well on training but poorly on API requests from new data.
Ignoring latency: A model with great accuracy but slow API response frustrates users.

Self-check question

Your Flask API model has 98% accuracy but only 12% recall on fraud detection. Is it good for production? Why or why not?

Answer: No, it is not good. The low recall means the model misses most fraud cases, which is dangerous. Even with high accuracy, missing fraud is costly. You should improve recall before production.

Key Result

For Flask API model serving, balance accuracy and latency, and carefully monitor precision and recall to ensure useful, timely predictions.