0
0
Prompt Engineering / GenAIml~20 mins

Parent-child document retrieval in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Parent-child document retrieval
Problem:You want to build a model that retrieves child documents based on their parent documents in a dataset. The current model retrieves child documents but often misses relevant ones or retrieves irrelevant children.
Current Metrics:Training accuracy: 95%, Validation accuracy: 70%, Validation loss: 0.85
Issue:The model is overfitting. It performs very well on training data but poorly on validation data, indicating it does not generalize well to new parent-child pairs.
Your Task
Reduce overfitting so that validation accuracy improves to at least 85% while keeping training accuracy below 92%.
You cannot change the dataset or add more data.
You must keep the parent-child retrieval architecture but can adjust model hyperparameters and add regularization.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
Prompt Engineering / GenAI
import tensorflow as tf
from tensorflow.keras import layers, models, callbacks

# Sample parent-child retrieval model
input_parent = layers.Input(shape=(100,), name='parent_input')
input_child = layers.Input(shape=(100,), name='child_input')

# Shared embedding layer
embedding = layers.Dense(64, activation='relu')
parent_emb = embedding(input_parent)
child_emb = embedding(input_child)

# Add dropout to reduce overfitting
parent_emb = layers.Dropout(0.3)(parent_emb)
child_emb = layers.Dropout(0.3)(child_emb)

# Combine embeddings
combined = layers.concatenate([parent_emb, child_emb])

# Smaller dense layers
x = layers.Dense(32, activation='relu')(combined)
x = layers.Dropout(0.3)(x)
output = layers.Dense(1, activation='sigmoid')(x)

model = models.Model(inputs=[input_parent, input_child], outputs=output)

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.0005),
              loss='binary_crossentropy',
              metrics=['accuracy'])

# Early stopping callback
early_stop = callbacks.EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

# Assuming X_train_parent, X_train_child, y_train, X_val_parent, X_val_child, y_val are defined
# model.fit([X_train_parent, X_train_child], y_train, epochs=50, batch_size=32, validation_data=([X_val_parent, X_val_child], y_val), callbacks=[early_stop])
Added dropout layers after embedding and dense layers to reduce overfitting.
Reduced dense layer size from 64 to 32 units to simplify the model.
Lowered learning rate from 0.001 to 0.0005 for smoother training.
Added early stopping to stop training when validation loss stops improving.
Results Interpretation

Before: Training accuracy was 95%, validation accuracy was 70%, showing overfitting.

After: Training accuracy dropped to 90%, validation accuracy improved to 87%, and validation loss decreased, indicating better generalization.

Adding dropout, reducing model complexity, lowering learning rate, and using early stopping help reduce overfitting and improve validation accuracy in parent-child document retrieval models.
Bonus Experiment
Try using a contrastive loss function instead of binary crossentropy to better learn the relationship between parent and child documents.
💡 Hint
Contrastive loss encourages the model to bring related parent-child pairs closer in embedding space and push unrelated pairs apart, which can improve retrieval accuracy.