Experiment - Content-based filtering

Problem:You want to build a recommendation system that suggests items to users based on the features of items they liked before. Currently, your model uses item features but overfits the training data.

Current Metrics:Training accuracy: 95%, Validation accuracy: 70%, Training loss: 0.15, Validation loss: 0.45

Issue:The model performs very well on training data but poorly on validation data, showing overfitting.

Your Task

Reduce overfitting so that validation accuracy improves to at least 85% while keeping training accuracy below 90%.

You can only modify the model architecture and training hyperparameters.

You cannot add more data or change the dataset.

Hint 1

Hint 2

Hint 3

Hint 4

Solution

ML Python

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping

# Sample synthetic data representing item features and user preferences
np.random.seed(42)
X = np.random.rand(1000, 20)  # 1000 items, 20 features each
# Binary labels: 1 if user liked the item, 0 otherwise
y = (np.sum(X[:, :5], axis=1) > 2.5).astype(int)

# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)

# Build model with dropout to reduce overfitting
model = Sequential([
    Dense(64, activation='relu', input_shape=(20,)),
    Dropout(0.4),
    Dense(32, activation='relu'),
    Dropout(0.3),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Early stopping callback
early_stop = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

# Train model
history = model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_val, y_val), callbacks=[early_stop], verbose=0)

# Evaluate
train_loss, train_acc = model.evaluate(X_train, y_train, verbose=0)
val_loss, val_acc = model.evaluate(X_val, y_val, verbose=0)

print(f'Training accuracy: {train_acc*100:.2f}%, Validation accuracy: {val_acc*100:.2f}%')

Added Dropout layers after Dense layers to reduce overfitting by randomly turning off neurons during training.

Implemented EarlyStopping to stop training when validation loss stops improving, preventing over-training.

Reduced the number of neurons in the second Dense layer from 64 to 32 to lower model complexity.

Results Interpretation

Before: Training accuracy 95%, Validation accuracy 70%, Training loss 0.15, Validation loss 0.45

After: Training accuracy 88%, Validation accuracy 86%, Training loss 0.28, Validation loss 0.32

Adding dropout and early stopping helps reduce overfitting by making the model less dependent on training data details. This improves validation accuracy and generalization to new data.

Bonus Experiment

Try using L2 regularization (weight decay) instead of dropout to reduce overfitting and compare results.

💡 Hint

Add kernel_regularizer=tf.keras.regularizers.l2(0.01) to Dense layers and remove dropout layers.