PyTorchml~8 mins

Why generative models create data in PyTorch - Why Metrics Matter

Choose your learning style9 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Metrics & Evaluation - Why generative models create data

Which metric matters for this concept and WHY

For generative models that create data, we want to know how well the model mimics real data. Key metrics include Likelihood (how probable the model thinks real data is), Frechet Inception Distance (FID) (how close generated images are to real ones), and Inception Score (IS). These metrics tell us if the generated data looks real and diverse. Likelihood measures fit to data, while FID and IS measure quality and variety of generated samples.

Confusion matrix or equivalent visualization (ASCII)

    Real Data vs Generated Data Similarity (Conceptual)

          | Real Data | Generated Data
    ------|-----------|----------------
    Real  |    TP     |       FN       
    Data  | (real data| (real data     
          | classified| classified as  
          | as real)  | generated)     
    ------|-----------|----------------
    Fake  |    FP     |       TN       
    Data  | (generated| (generated data 
          | data      | classified as  
          | classified| fake)          
          | as real)  |                

    Note: This is a conceptual analogy. Generative models don't use confusion matrices directly but metrics like FID capture similar ideas of similarity.

Precision vs Recall (or equivalent tradeoff) with concrete examples

In generative models, precision means how many generated samples look real (quality), and recall means how much of the real data variety the model covers (diversity).

Example: A model that creates only one perfect image has high precision but low recall (no variety). A model that creates many varied images but some look fake has high recall but lower precision.

Good generative models balance both: they create many different images that all look real.

What "good" vs "bad" metric values look like for this use case

Likelihood: Higher is better. Good models assign high probability to real data.
FID: Lower is better. Good models have FID close to 0, meaning generated data is very similar to real data.
Inception Score: Higher is better. Good models produce clear, diverse images.

Bad example: High FID (e.g., 100+) means generated data looks very different from real data. Low Inception Score means images are blurry or repetitive.

Metrics pitfalls (accuracy paradox, data leakage, overfitting indicators)

Mode collapse: Model generates limited variety (low recall) but metrics like likelihood may still look okay.
Overfitting: Model memorizes training data, so generated samples look good but model fails to generalize.
Data leakage: If test data leaks into training, metrics falsely improve.
Misleading likelihood: Some models have intractable likelihood, so surrogate metrics are needed.

Self-check: Your model has 98% accuracy but 12% recall on fraud. Is it good?

This question is from classification but helps understand tradeoffs. A model with 98% accuracy but only 12% recall on fraud misses most fraud cases. For fraud detection, recall is critical because missing fraud is costly. So, despite high accuracy, this model is not good for production.

Similarly, for generative models, high quality (precision) alone is not enough if diversity (recall) is very low.

Key Result

Generative models need metrics that measure both quality (precision) and diversity (recall) of created data to ensure realistic and varied outputs.