For self-hosted large language models like Llama and Mistral, key metrics include perplexity and accuracy on downstream tasks. Perplexity measures how well the model predicts text, showing its understanding of language patterns. Accuracy on tasks like question answering or summarization shows real-world usefulness. These metrics matter because they tell us if the model generates sensible, relevant text and performs well on specific jobs.
Self-hosted LLMs (Llama, Mistral) in Prompt Engineering / GenAI - Model Metrics & Evaluation
For language models, a confusion matrix is less common. Instead, we use perplexity and task-specific accuracy. For example, on a classification task, a confusion matrix might look like this:
| Predicted Positive | Predicted Negative |
|--------------------|--------------------|
| True Positive (TP) | False Negative (FN) |
| False Positive (FP) | True Negative (TN) |
From this, we calculate precision, recall, and F1 score to understand model errors.
When using self-hosted LLMs for tasks like spam detection or content moderation, precision and recall tradeoffs matter:
- High Precision: The model rarely marks good content as spam. Useful when false alarms are costly.
- High Recall: The model catches most spam, even if some good content is flagged. Important when missing spam is risky.
Choosing which to prioritize depends on the use case. For example, in medical text analysis, high recall is critical to catch all important info.
Good metrics:
- Low perplexity (e.g., below 20) indicating strong language understanding.
- High accuracy (above 85%) on specific tasks like classification or summarization.
- Balanced precision and recall (both above 80%) for classification tasks.
Bad metrics:
- High perplexity (above 50), meaning the model struggles to predict text.
- Low accuracy (below 60%) on tasks, showing poor performance.
- Very low recall (below 50%) causing missed important cases.
- Accuracy paradox: High accuracy can be misleading if data is imbalanced (e.g., many easy examples).
- Data leakage: Using test data during training inflates metrics falsely.
- Overfitting: Model performs well on training but poorly on new data, hiding true performance.
- Ignoring task-specific metrics: Using only perplexity without checking real task results can miss issues.
Your self-hosted LLM has 98% accuracy on a classification task but only 12% recall on the important class. Is it good for production? Why or why not?
Answer: No, it is not good. The low recall means the model misses most important cases, which can be critical depending on the task. High accuracy alone is misleading if the model ignores the key class.