For chat completions, the key metrics are response relevance and coherence. These are often measured by perplexity and BLEU or ROUGE scores, which check how well the model predicts or matches expected responses. Additionally, user satisfaction metrics like engagement rate and response time matter to ensure the chat feels natural and fast.
Chat completions endpoint in Prompt Engineering / GenAI - Model Metrics & Evaluation
Chat completions don't use a classic confusion matrix because outputs are text, not simple classes. Instead, evaluation uses metrics like:
Perplexity = exp(-1/N * sum(log P(word_i)))
BLEU = precision of n-grams between generated and reference text
ROUGE = recall of overlapping n-grams or sequences
These measure how well the model predicts or matches expected responses.
In chat completions, precision means the model's answers are accurate and relevant. Recall means the model covers all important points in the conversation.
Example: If the model is very precise but low recall, it gives correct but very short answers, missing some user questions. If recall is high but precision low, the model talks a lot but includes irrelevant or wrong info.
Good chat models balance precision and recall to be both relevant and complete.
- Good: Low perplexity (close to 10 or less), BLEU/ROUGE scores above 0.5, fast response time under 1 second, and high user engagement.
- Bad: High perplexity (above 50), BLEU/ROUGE below 0.2, slow responses over 3 seconds, and low user satisfaction or many fallback answers.
- Accuracy paradox: High BLEU doesn't always mean good chat quality because it may ignore creativity or context.
- Data leakage: Testing on data the model saw during training inflates scores falsely.
- Overfitting: Model memorizes training responses but fails on new questions, showing low real-world performance.
- Ignoring user experience: Metrics like speed and engagement are as important as text quality.
Your chat model has 98% accuracy on a test set but users report many irrelevant answers and slow responses. Is it good for production? Why or why not?
Answer: No, because accuracy here may not reflect real chat quality. The model might be overfitting or tested on easy data. User experience metrics like relevance and speed are crucial for chat models.