0
0
Prompt Engineering / GenAIml~8 mins

Message roles (system, user, assistant) in Prompt Engineering / GenAI - Model Metrics & Evaluation

Choose your learning style9 modes available
Metrics & Evaluation - Message roles (system, user, assistant)
Which metric matters for Message roles and WHY

When working with message roles like system, user, and assistant in AI chat models, the key metric is accuracy of role classification or correct role assignment. This ensures the model understands who is speaking and responds appropriately. For example, the system role sets instructions, the user role asks questions, and the assistant role replies. Correct role recognition helps the AI behave as expected.

Confusion matrix for role classification
      | Predicted \ Actual | System | User | Assistant |
      |--------------------|--------|------|-----------|
      | System             | 90     | 5    | 5         |
      | User               | 3      | 92   | 5         |
      | Assistant          | 2      | 4    | 94        |

      Total samples = 300

      Precision and recall per role:
      - System Precision = 90 / (90 + 3 + 2) = 90 / 95 = 0.947
      - System Recall = 90 / (90 + 5 + 5) = 90 / 100 = 0.9
      - User Precision = 92 / (5 + 92 + 4) = 92 / 101 = 0.910
      - User Recall = 92 / (3 + 92 + 5) = 92 / 100 = 0.92
      - Assistant Precision = 94 / (5 + 5 + 94) = 94 / 104 = 0.904
      - Assistant Recall = 94 / (2 + 4 + 94) = 94 / 100 = 0.94
    
Precision vs Recall tradeoff with examples

In message role classification, precision means how often the predicted role is correct. Recall means how many actual messages of a role were found.

If precision is low, the model confuses roles often, causing wrong responses. If recall is low, some messages are missed or misclassified, leading to ignored instructions or questions.

Example: If the system role is confused with user role, the AI might ignore important instructions. Here, high recall for system role is critical to catch all instructions.

Example: If user messages are misclassified as assistant, the AI might respond to itself, causing confusion. High precision for user role avoids this.

What good vs bad metric values look like

Good: Precision and recall above 90% for all roles means the model correctly identifies who is speaking most of the time.

Bad: Precision or recall below 70% means many messages are misclassified. For example, if system role recall is 50%, half of instructions are missed, causing poor AI behavior.

Common pitfalls in metrics
  • Accuracy paradox: If one role is very common, high accuracy can hide poor performance on rare roles.
  • Data leakage: Training data containing future messages can inflate metrics falsely.
  • Overfitting: Model memorizes training roles but fails on new conversations.
  • Ignoring role context: Metrics without considering conversation flow can mislead about real performance.
Self-check question

Your model has 98% accuracy but only 12% recall on the system role. Is it good for production? Why or why not?

Answer: No, it is not good. Even though overall accuracy is high, the model misses 88% of system messages (instructions). This means important instructions are ignored, causing the AI to behave incorrectly. High recall on system role is critical.

Key Result
High precision and recall for each message role ensure the AI correctly understands and responds to system, user, and assistant messages.