For models like GPT (text generation), DALL-E and Stable Diffusion (image generation), the key metrics differ because their tasks differ.
GPT: We look at perplexity to see how well the model predicts text. Lower perplexity means better predictions.
DALL-E and Stable Diffusion: We use FID (Fréchet Inception Distance) and IS (Inception Score) to measure image quality and diversity. Lower FID and higher IS mean better images.
These metrics help us know if the model creates realistic and useful outputs.