Experiment - Small dataset strategies

Problem:You want to train a computer vision model to classify images, but you only have 500 labeled images. The current model overfits quickly.

Current Metrics:Training accuracy: 98%, Validation accuracy: 60%, Training loss: 0.05, Validation loss: 1.2

Issue:The model is overfitting due to the small dataset size, causing poor validation accuracy.

Your Task

Reduce overfitting and improve validation accuracy to at least 75% while keeping training accuracy below 90%.

You cannot collect more data.

You must use the same model architecture.

You can only change training strategies and data preprocessing.

Hint 1

Hint 2

Hint 3

Hint 4

Solution

Computer Vision

import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Load dataset (placeholder, replace with actual data loading)
# For example purposes, use CIFAR-10 but only 500 samples
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()

# Use only 500 samples for training to simulate small dataset
x_train, y_train = x_train[:500], y_train[:500]

# Normalize images
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0

# Data augmentation
datagen = ImageDataGenerator(
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    horizontal_flip=True
)
datagen.fit(x_train)

# Define model architecture (same as original)
model = models.Sequential([
    layers.Conv2D(32, (3,3), activation='relu', input_shape=(32,32,3)),
    layers.MaxPooling2D((2,2)),
    layers.Conv2D(64, (3,3), activation='relu'),
    layers.MaxPooling2D((2,2)),
    layers.Flatten(),
    layers.Dropout(0.5),  # Added dropout to reduce overfitting
    layers.Dense(64, activation='relu'),
    layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Early stopping callback
early_stop = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

# Train model with data augmentation
history = model.fit(
    datagen.flow(x_train, y_train, batch_size=32),
    epochs=50,
    validation_data=(x_test, y_test),
    callbacks=[early_stop]
)

# Evaluate final model
train_loss, train_acc = model.evaluate(x_train, y_train, verbose=0)
val_loss, val_acc = model.evaluate(x_test, y_test, verbose=0)

print(f'Training accuracy: {train_acc*100:.2f}%')
print(f'Validation accuracy: {val_acc*100:.2f}%')

Added data augmentation to increase effective dataset size.

Added dropout layer before dense layers to reduce overfitting.

Used early stopping to stop training when validation loss stops improving.

Results Interpretation

Before: Training accuracy 98%, Validation accuracy 60%, high overfitting.

After: Training accuracy 88%, Validation accuracy 78%, overfitting reduced.

Using data augmentation, dropout, and early stopping helps reduce overfitting on small datasets and improves validation accuracy.

Bonus Experiment

Try using transfer learning with a pre-trained model like MobileNetV2 and fine-tune it on the small dataset.

💡 Hint

Freeze the base layers of the pre-trained model and train only the top layers first, then optionally unfreeze some base layers for fine-tuning.