For text generation using RNNs, common metrics include perplexity and cross-entropy loss. These measure how well the model predicts the next word or character in a sequence. Lower perplexity means the model is better at guessing the next token, which leads to more natural and coherent text. Accuracy can be used but is less informative because predicting exact next words is hard and many outputs can be valid.
Text generation with RNN in TensorFlow - Model Metrics & Evaluation
Text generation is a sequence prediction task with many possible outputs, so confusion matrices are not practical. Instead, we look at loss curves and perplexity scores over training epochs.
Epoch | Training Loss | Validation Loss | Perplexity
-----------------------------------------------
1 | 2.5 | 2.7 | 12.2
2 | 2.1 | 2.3 | 9.9
3 | 1.8 | 2.0 | 7.4
4 | 1.6 | 1.9 | 6.7
5 | 1.5 | 1.8 | 6.1
This shows the model is improving as loss and perplexity decrease.
Precision and recall are less relevant for text generation because the output is open-ended. Instead, there is a tradeoff between creativity and coherence. For example:
- High creativity: The model generates surprising and diverse text but may produce errors or nonsense.
- High coherence: The model produces safe, predictable text but may be boring or repetitive.
This tradeoff can be controlled by parameters like temperature during sampling.
Good:
- Low cross-entropy loss (e.g., below 1.5 on validation data)
- Low perplexity (close to 1 means perfect prediction)
- Generated text is fluent, relevant, and context-aware
Bad:
- High loss and perplexity (e.g., above 3 or 4)
- Generated text is random, repetitive, or nonsensical
- Model overfits training data and fails on new prompts
- Overfitting: Low training loss but high validation loss means the model memorizes training text and can't generalize.
- Ignoring diversity: Only optimizing for loss can lead to dull, repetitive text.
- Using accuracy: Accuracy is misleading because many next words can be correct; it doesn't capture quality well.
- Data leakage: If test data overlaps with training, metrics will be unrealistically good.
Your RNN text generation model has a validation accuracy of 85% but a perplexity of 50. Is this model good for generating text? Why or why not?
Answer: No, this model is not good. The high accuracy is misleading because many next words can be correct, so accuracy is not reliable here. The very high perplexity (50) means the model is very uncertain about the next word predictions, so generated text will likely be poor quality and incoherent.