ML Pythonml~8 mins

Collaborative filtering in ML Python - Model Metrics & Evaluation

Choose your learning style9 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Metrics & Evaluation - Collaborative filtering

Which metric matters for Collaborative filtering and WHY

Collaborative filtering predicts user preferences, so we want metrics that show how close predictions are to actual user choices.

Common metrics include:

Root Mean Squared Error (RMSE): Measures average prediction error size. Lower is better.
Mean Absolute Error (MAE): Average absolute difference between predicted and actual ratings.
Precision and Recall: For top-N recommendations, precision shows how many recommended items were liked, recall shows how many liked items were recommended.
F1 Score: Balances precision and recall for recommendation relevance.
Mean Average Precision (MAP): Measures ranking quality of recommended items.

We choose metrics based on the goal: rating prediction accuracy or recommendation relevance.

Confusion matrix or equivalent visualization

For top-N recommendation evaluation, we can use a confusion matrix based on whether items were recommended and liked:

      |---------------------------|
      |           | Liked | Not Liked |
      |-----------|-------|----------|
      | Recommended |  TP  |    FP    |
      | Not Recommended |  FN  |    TN    |
      |---------------------------|

Where:

TP (True Positive): Recommended and liked items.
FP (False Positive): Recommended but not liked items.
FN (False Negative): Liked but not recommended items.
TN (True Negative): Not recommended and not liked items.

This matrix helps calculate precision, recall, and F1 for recommendations.

Precision vs Recall tradeoff with examples

In collaborative filtering, precision and recall often trade off:

High Precision, Low Recall: Recommending fewer items but mostly liked ones. Good if you want to avoid annoying users with bad suggestions.
High Recall, Low Precision: Recommending many items including most liked ones but also many disliked. Good if you want to show many options and don't mind some misses.

Example: A movie app wants to avoid bad recommendations (high precision). It recommends 5 movies, 4 are liked (precision=0.8), but misses many liked movies (recall low).

Another app wants to show all possible liked movies (high recall). It recommends 20 movies, 10 are liked (precision=0.5), but covers most liked movies (recall high).

What "good" vs "bad" metric values look like for Collaborative filtering

RMSE/MAE: Good models have low values (e.g., RMSE < 1 on a 5-star scale). High values mean poor rating predictions.
Precision: Good if above 0.7 for top recommendations. Below 0.4 means many wrong suggestions.
Recall: Good if above 0.6, meaning most liked items are recommended. Below 0.3 means many liked items missed.
F1 Score: Good if above 0.65, balancing precision and recall well.
MAP: Higher values (closer to 1) mean better ranking of relevant items.

Good values depend on dataset size and user behavior but these ranges are typical starting points.

Common pitfalls in Collaborative filtering metrics

Ignoring data sparsity: Many users rate few items, so metrics can be misleading if not averaged properly.
Overfitting: Model performs well on training data but poorly on new users or items.
Popularity bias: Recommending only popular items can inflate precision but reduce diversity.
Cold start problem: New users/items have no data, metrics may be poor or undefined.
Using accuracy alone: Accuracy is not meaningful for recommendation because most items are not interacted with.
Data leakage: Testing on data that the model has seen during training inflates metrics falsely.

Self-check question

Your collaborative filtering model has 98% accuracy but only 12% recall on liked items. Is it good for production? Why or why not?

Answer: No, it is not good. The high accuracy is misleading because most items are not liked, so predicting "not liked" often is easy. The low recall means the model misses most liked items, so users get poor recommendations.

Key Result

Precision, recall, and RMSE are key metrics to evaluate collaborative filtering models, balancing recommendation relevance and prediction accuracy.