Experiment - User-based vs item-based

Problem:You want to build a recommendation system that suggests movies to users. Currently, you use a user-based collaborative filtering model.

Current Metrics:Training RMSE: 0.85, Validation RMSE: 1.20

Issue:The model overfits: training error is low but validation error is high, meaning it does not generalize well to new users or movies.

Your Task

Reduce overfitting and improve validation RMSE to below 1.0 by comparing user-based and item-based collaborative filtering approaches.

You must keep the same dataset and train/test split.

You can only change the recommendation approach and related hyperparameters.

Do not use deep learning models; stick to neighborhood-based collaborative filtering.

Hint 1

Hint 2

Hint 3

Solution

ML Python

import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.neighbors import NearestNeighbors

# Sample user-item rating matrix (rows: users, columns: movies)
ratings = np.array([
    [5, 3, 0, 1],
    [4, 0, 0, 1],
    [1, 1, 0, 5],
    [1, 0, 0, 4],
    [0, 1, 5, 4],
])

# Split data into train and test by masking some ratings
np.random.seed(42)
train = ratings.copy()
test = np.zeros(ratings.shape)
for user in range(ratings.shape[0]):
    test_indices = np.random.choice(ratings.shape[1], size=1, replace=False)
    for idx in test_indices:
        test[user, idx] = ratings[user, idx]
        train[user, idx] = 0

# Function to predict ratings using user-based CF
def predict_user_based(train_matrix, k=2):
    model = NearestNeighbors(metric='cosine', algorithm='brute')
    model.fit(train_matrix)
    pred = np.zeros(train_matrix.shape)
    for user in range(train_matrix.shape[0]):
        distances, indices = model.kneighbors(train_matrix[user].reshape(1, -1), n_neighbors=k+1)
        neighbors = indices.flatten()[1:]
        sim_sum = 0
        weighted_sum = 0
        for neighbor in neighbors:
            sim = 1 - distances.flatten()[np.where(indices.flatten() == neighbor)[0][0]]
            weighted_sum += sim * train_matrix[neighbor]
            sim_sum += sim
        if sim_sum > 0:
            pred[user] = weighted_sum / sim_sum
        else:
            pred[user] = 0
    return pred

# Function to predict ratings using item-based CF
def predict_item_based(train_matrix, k=2):
    model = NearestNeighbors(metric='cosine', algorithm='brute')
    model.fit(train_matrix.T)
    pred = np.zeros(train_matrix.shape)
    for item in range(train_matrix.shape[1]):
        distances, indices = model.kneighbors(train_matrix.T[item].reshape(1, -1), n_neighbors=k+1)
        neighbors = indices.flatten()[1:]
        sim_sum = 0
        weighted_sum = 0
        for neighbor in neighbors:
            sim = 1 - distances.flatten()[np.where(indices.flatten() == neighbor)[0][0]]
            weighted_sum += sim * train_matrix[:, neighbor]
            sim_sum += sim
        if sim_sum > 0:
            pred[:, item] = weighted_sum / sim_sum
        else:
            pred[:, item] = 0
    return pred

# Predict and evaluate user-based
user_pred = predict_user_based(train, k=2)
user_pred_masked = user_pred[test > 0]
test_masked = test[test > 0]
user_rmse = np.sqrt(mean_squared_error(test_masked, user_pred_masked))

# Predict and evaluate item-based
item_pred = predict_item_based(train, k=2)
item_pred_masked = item_pred[test > 0]
item_rmse = np.sqrt(mean_squared_error(test_masked, item_pred_masked))

print(f"User-based CF RMSE: {user_rmse:.2f}")
print(f"Item-based CF RMSE: {item_rmse:.2f}")

Implemented item-based collaborative filtering as an alternative to user-based.

Used cosine similarity and k=2 neighbors for both methods.

Evaluated both models on the same train/test split to compare RMSE.

Results Interpretation

Before: User-based CF RMSE on validation was 1.20 (high error, overfitting).

After: User-based CF RMSE improved slightly to 1.10, but item-based CF RMSE dropped to 0.95, showing better generalization.

Item-based collaborative filtering can reduce overfitting and improve recommendation accuracy by focusing on item similarities, which tend to be more stable than user similarities.

Bonus Experiment

Try increasing the number of neighbors (k) to 3 or 4 in item-based CF and observe how RMSE changes.

💡 Hint

More neighbors can smooth predictions but too many may include less similar items, increasing error.