0
0
NLPml~8 mins

Batch vs real-time inference in NLP - Metrics Comparison

Choose your learning style9 modes available
Metrics & Evaluation - Batch vs real-time inference
Which metric matters for Batch vs Real-time inference and WHY

For batch inference, throughput and latency matter because you process many inputs at once and want to finish quickly.

For real-time inference, latency is the most important metric because users expect fast responses.

Accuracy, precision, and recall still matter for model quality, but performance metrics like latency and throughput decide if the system meets user needs.

Confusion matrix example (for classification quality)
      | Predicted Positive | Predicted Negative |
      |--------------------|--------------------|
      | True Positive (TP)  | False Positive (FP) |
      | False Negative (FN) | True Negative (TN)  |

      Example numbers for 1000 samples:
      TP=400, FP=100, FN=50, TN=450

      Total = TP + FP + FN + TN = 1000
    

This confusion matrix is the same for batch or real-time inference since it measures model correctness.

Tradeoff: Precision vs Recall in Batch vs Real-time

In batch inference, you can afford to tune for higher precision because you have time to review or reprocess results.

In real-time inference, you might prioritize recall to catch as many important cases as possible quickly, even if some false alarms happen.

Example: A spam filter in real-time should catch most spam (high recall) to protect users immediately.

What "good" vs "bad" metric values look like

Batch inference: Good throughput (e.g., 1000 samples/sec), acceptable latency (e.g., minutes), and high accuracy (e.g., 95%).

Real-time inference: Low latency (e.g., under 100 ms per request), stable throughput (e.g., 10 requests/sec), and high recall (e.g., 90%) for critical cases.

Bad values mean slow responses in real-time or very low accuracy in both modes.

Common pitfalls in metrics for Batch vs Real-time inference
  • Ignoring latency in real-time systems leads to poor user experience.
  • Measuring only accuracy without considering latency or throughput.
  • Data leakage causing inflated accuracy in batch evaluation but poor real-time results.
  • Overfitting to batch data that does not represent real-time input distribution.
Self-check question

Your real-time model has 98% accuracy but only 12% recall on fraud cases. Is it good for production? Why or why not?

Answer: No, it is not good. Even though accuracy is high, the model misses most fraud cases (low recall). In fraud detection, missing fraud is very costly, so recall is more important.

Key Result
Latency is key for real-time inference; throughput and accuracy matter more for batch inference.