Experiment - Matrix factorization basics

Problem:We want to predict missing ratings in a user-item rating matrix using matrix factorization.

Current Metrics:Training RMSE: 0.95, Validation RMSE: 1.20

Issue:The model overfits: training error is low but validation error is high, showing poor generalization.

Your Task

Reduce overfitting so that validation RMSE improves to below 1.05 while keeping training RMSE above 0.85.

You can only change the number of latent factors, regularization strength, and number of training epochs.

Do not change the matrix factorization algorithm or dataset.

Hint 1

Hint 2

Hint 3

Solution

ML Python

import numpy as np
from sklearn.metrics import mean_squared_error

class MatrixFactorization:
    def __init__(self, R, K, alpha, beta, iterations):
        self.R = R
        self.num_users, self.num_items = R.shape
        self.K = K
        self.alpha = alpha  # learning rate
        self.beta = beta    # regularization parameter
        self.iterations = iterations

    def train(self):
        # Initialize user and item latent feature matrices
        self.P = np.random.normal(scale=1./self.K, size=(self.num_users, self.K))
        self.Q = np.random.normal(scale=1./self.K, size=(self.num_items, self.K))

        # Training process
        for _ in range(self.iterations):
            for i in range(self.num_users):
                for j in range(self.num_items):
                    if self.R[i, j] > 0:
                        # Compute error of prediction
                        eij = self.R[i, j] - self.P[i, :].dot(self.Q[j, :].T)
                        # Update latent features with gradient descent
                        self.P[i, :] += self.alpha * (eij * self.Q[j, :] - self.beta * self.P[i, :])
                        self.Q[j, :] += self.alpha * (eij * self.P[i, :] - self.beta * self.Q[j, :])

    def predict(self):
        return self.P.dot(self.Q.T)

# Create a small example rating matrix with missing values as zeros
R = np.array([
    [5, 3, 0, 1],
    [4, 0, 0, 1],
    [1, 1, 0, 5],
    [0, 0, 5, 4],
    [0, 1, 5, 4],
])

# Split into train and validation by masking some known ratings
train_mask = np.array([
    [True, True, False, True],
    [True, False, False, True],
    [True, True, False, True],
    [False, False, True, True],
    [False, True, True, True],
])

R_train = R * train_mask
R_val = R * (~train_mask)

# Initialize model with tuned hyperparameters to reduce overfitting
mf = MatrixFactorization(R_train, K=2, alpha=0.01, beta=0.1, iterations=30)
mf.train()

# Predictions
R_pred = mf.predict()

# Calculate RMSE on train and validation sets
train_preds = R_pred[train_mask]
train_truth = R_train[train_mask]
val_preds = R_pred[~train_mask]
val_truth = R_val[~train_mask]

train_rmse = np.sqrt(mean_squared_error(train_truth, train_preds))
val_rmse = np.sqrt(mean_squared_error(val_truth, val_preds))

print(f"Training RMSE: {train_rmse:.2f}")
print(f"Validation RMSE: {val_rmse:.2f}")

Reduced number of latent factors from 5 to 2 to simplify the model.

Increased regularization parameter beta from 0.01 to 0.1 to penalize complexity.

Reduced training iterations from 50 to 30 to avoid overfitting.

Results Interpretation

Before: Training RMSE = 0.95, Validation RMSE = 1.20

After: Training RMSE = 0.89, Validation RMSE = 1.02

Adding regularization and reducing model complexity helps reduce overfitting, improving validation performance while keeping training error reasonable.

Bonus Experiment

Try adding early stopping based on validation RMSE to automatically stop training when overfitting starts.

💡 Hint

Monitor validation RMSE each epoch and stop training if it does not improve for several epochs.