In sequence-to-sequence tasks like translation or summarization, accuracy and BLEU score are important. Accuracy shows how often the model predicts the exact next token correctly. BLEU score measures how close the whole predicted sequence is to the correct sequence, even if not exact. This helps us understand if the model is learning to generate meaningful sequences.
Sequence-to-sequence basics in TensorFlow - Model Metrics & Evaluation
For sequence-to-sequence, confusion matrices are less common because outputs are sequences, not single labels. Instead, we look at token-level accuracy or sequence-level metrics.
Token-level example:
True sequence: [I, am, happy]
Predicted: [I, am, sad]
Token accuracy = 2 correct tokens / 3 total tokens = 66.7%
BLEU score compares n-gram overlaps between predicted and true sequences.
In sequence generation, precision means how many predicted tokens are correct, recall means how many true tokens were predicted. For example, in chatbot replies, high precision means the reply is mostly correct words, high recall means the reply covers most of the expected content.
Sometimes, a model might generate safe but short replies (high precision, low recall). Other times, it might generate longer replies with more content but some errors (higher recall, lower precision). Balancing these helps make replies both accurate and informative.
Good: Token accuracy above 80%, BLEU score above 0.5 usually means the model generates sequences close to the target. This means the model understands the task well.
Bad: Token accuracy below 50%, BLEU score near 0 means the model predictions are mostly wrong or random. The model might be guessing or not learning the sequence patterns.
- Ignoring sequence length: Short predictions can have high accuracy but miss important content.
- Overfitting: High training accuracy but low validation BLEU means the model memorizes training sequences.
- Data leakage: If test sequences appear in training, metrics look falsely good.
- Using only accuracy: It misses sequence quality; BLEU or ROUGE give better insight.
Your sequence-to-sequence model has 98% token accuracy but BLEU score of 0.1 on test data. Is it good for production?
Answer: No. High token accuracy but very low BLEU means the model predicts many tokens correctly but fails to produce meaningful full sequences. It might predict common words but not the right order or context. You should improve sequence-level learning before production.