In language generation, we want to measure how well the model creates text that is both coherent and diverse. Metrics like perplexity show how well the model predicts the next word, but they don't capture creativity or variety.
Instead, we look at diversity metrics such as distinct-n (how many unique n-grams appear) and human evaluation for fluency and relevance. Temperature and sampling control randomness in word choice, affecting diversity and quality.
So, the key metrics are diversity (to avoid boring, repetitive text) and coherence (to keep text meaningful). We balance these by adjusting temperature and sampling methods.