When you deploy a machine learning model through an API, the key metrics to watch are latency and throughput. Latency tells you how fast the model responds to a request, which is important for user experience. Throughput shows how many requests the API can handle per second, which matters for scaling. Besides these, traditional model metrics like accuracy, precision, and recall remain important to ensure the model predictions are good. But for deployment, speed and reliability are just as critical.
0
0
API-based deployment in Prompt Engineering / GenAI - Model Metrics & Evaluation
Metrics & Evaluation - API-based deployment
Which metric matters for API-based deployment and WHY
Confusion matrix example for model quality
| Predicted Positive | Predicted Negative |
|--------------------|--------------------|
| True Positive (TP) | False Positive (FP) |
| False Negative (FN) | True Negative (TN) |
Example:
TP = 80, FP = 20, FN = 10, TN = 90
Total samples = 200
From this, you calculate:
- Precision = TP / (TP + FP) = 80 / (80 + 20) = 0.8
- Recall = TP / (TP + FN) = 80 / (80 + 10) = 0.89
- Accuracy = (TP + TN) / Total = (80 + 90) / 200 = 0.85
Precision vs Recall tradeoff with API deployment examples
Imagine your API is for spam detection:
- High precision means the API rarely marks good emails as spam. This avoids annoying users.
- High recall means the API catches most spam emails, even if some good emails get flagged.
For API deployment, if your API is slow but very precise, users may get frustrated waiting. If it is fast but misses many spam emails (low recall), it fails its purpose. So you balance model quality metrics with API speed.
What "good" vs "bad" metric values look like for API-based deployment
Good:
- Latency under 200 milliseconds per request
- Throughput of hundreds of requests per second
- Precision and recall above 0.8 for the main task
- Consistent response times without spikes
Bad:
- Latency over 1 second causing user wait
- Throughput too low to handle peak traffic
- Precision or recall below 0.5, meaning many wrong predictions
- Unstable API causing errors or timeouts
Common pitfalls in metrics for API-based deployment
- Ignoring latency: A model with great accuracy but slow API response frustrates users.
- Data leakage: Training data leaking into test data inflates accuracy but fails in real API use.
- Overfitting: High training accuracy but poor API predictions on new data.
- Not monitoring API errors: Model might be good but API crashes or timeouts ruin experience.
- Using accuracy alone: For imbalanced data, accuracy can be misleading; precision and recall matter more.
Self-check: Your model has 98% accuracy but 12% recall on fraud. Is it good?
No, it is not good for fraud detection. The high accuracy likely comes from many normal cases correctly predicted. But the very low recall means the model misses 88% of fraud cases, which is dangerous. For fraud, catching as many frauds as possible (high recall) is critical, even if some false alarms happen. So this model needs improvement before deployment.
Key Result
In API-based deployment, balancing model quality (precision, recall) with API performance (latency, throughput) is key for success.