TensorFlowml~8 mins

Classification reports in TensorFlow - Model Metrics & Evaluation

Choose your learning style9 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Metrics & Evaluation - Classification reports

Which metric matters for Classification Reports and WHY

A classification report summarizes key metrics: precision, recall, and F1-score for each class. These metrics help us understand how well the model predicts each category.

Precision tells us how many predicted positives are actually correct. Recall tells us how many actual positives the model found. F1-score balances precision and recall into one number.

Using these metrics together helps us see if the model is making too many false alarms (low precision) or missing important cases (low recall).

Confusion Matrix Example

      Actual \ Predicted | Positive | Negative
      -------------------|----------|---------
      Positive           |    70    |   30
      Negative           |    10    |   90

Here, TP=70, FN=30, FP=10, TN=90. Total samples = 70+30+10+90 = 200.

From this, precision = 70 / (70 + 10) = 0.875, recall = 70 / (70 + 30) = 0.7, F1-score = 2 * (0.875 * 0.7) / (0.875 + 0.7) ≈ 0.778.

Precision vs Recall Tradeoff with Examples

Imagine a spam email filter:

High precision means most emails marked as spam really are spam. This avoids losing important emails.
High recall means catching most spam emails, but might mark some good emails as spam.

For a cancer detector:

High recall is critical to catch as many cancer cases as possible.
Precision can be lower because false alarms can be checked further.

Classification reports help balance these needs by showing precision and recall per class.

What "Good" vs "Bad" Metric Values Look Like

Good classification report values:

Precision and recall above 0.8 for important classes.
F1-score close to 1 means balanced and strong performance.

Bad values:

Precision or recall below 0.5 means many mistakes.
Large difference between precision and recall indicates imbalance (e.g., many false positives or false negatives).

Always check metrics per class, especially if classes are imbalanced.

Common Pitfalls in Using Classification Reports

Accuracy paradox: High accuracy can be misleading if classes are imbalanced.
Data leakage: If test data leaks into training, metrics look unrealistically good.
Overfitting: Very high training metrics but poor test metrics show the model memorizes data.
Ignoring per-class metrics hides poor performance on minority classes.

Self-Check Question

Your model has 98% accuracy but only 12% recall on the fraud class. Is it good for production?

Answer: No. The model misses 88% of fraud cases (low recall), which is dangerous. High accuracy is misleading because fraud is rare. You need to improve recall to catch more fraud.

Key Result

Classification reports provide precision, recall, and F1-score per class to evaluate model performance beyond accuracy.