When using GenAI APIs for the first time, the key metric to focus on is response relevance. This means how well the AI's answers match what you asked. Since GenAI often generates text, measuring exact correctness is tricky. Instead, you look at how useful and accurate the responses feel. Another important metric is latency, or how fast the API responds, because quick answers improve user experience.
First interaction with GenAI APIs - Model Metrics & Evaluation
For GenAI text generation, a confusion matrix is not typical. Instead, you can think of evaluation like this:
User Query: "What is the capital of France?"
Possible AI Responses:
- Correct: "Paris"
- Incorrect: "Berlin"
Evaluation:
- True Positive (TP): AI gives "Paris" when asked about France's capital.
- False Positive (FP): AI gives "Paris" when asked about Germany's capital.
- False Negative (FN): AI fails to say "Paris" when asked about France.
- True Negative (TN): AI correctly does not say "Paris" for unrelated questions.
This helps understand when the AI is right or wrong in context.
In GenAI APIs, precision means how often the AI's answers are correct when it gives an answer. Recall means how often the AI provides an answer when it should.
Example: If you ask many questions, and the AI only answers some, it might have high precision (answers are mostly right) but low recall (misses many questions).
For a chatbot, you want a balance: good precision so answers are reliable, and good recall so it answers most questions.
Good: The AI answers 90% of questions correctly (high precision) and responds to 85% of questions asked (high recall). Response time is under 1 second.
Bad: The AI answers only 50% of questions correctly and skips many questions (low recall). Responses take over 5 seconds, frustrating users.
- Accuracy paradox: If most questions are easy, a model that always answers "I don't know" might seem accurate but is useless.
- Data leakage: Testing on questions the AI was trained on can inflate performance.
- Overfitting: AI might memorize answers instead of understanding, failing on new questions.
- Ignoring latency: Fast but wrong answers are worse than slower, correct ones.
Your GenAI model answers 98% of questions with 98% accuracy but only responds to 12% of questions asked. Is it good for production? Why or why not?
Answer: No, because the model rarely answers questions (low recall). Even if answers are mostly correct, users will be frustrated by many unanswered queries.