0
0
Prompt Engineering / GenAIml~20 mins

Sentence transformers in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Sentence transformers
Problem:You want to create a model that converts sentences into vectors so that similar sentences have close vectors. The current model is trained but shows high accuracy on training data and much lower accuracy on validation data.
Current Metrics:Training accuracy: 95%, Validation accuracy: 70%
Issue:The model is overfitting. It performs very well on training data but poorly on validation data, meaning it does not generalize well.
Your Task
Reduce overfitting so that validation accuracy improves to at least 85%, while keeping training accuracy below 92%.
You can only change model architecture and training hyperparameters.
You cannot change the dataset or add more data.
Hint 1
Hint 2
Hint 3
Solution
Prompt Engineering / GenAI
from sentence_transformers import SentenceTransformer, losses, InputExample, evaluation
from torch.utils.data import DataLoader
import torch

# Prepare training examples
train_examples = [
    InputExample(texts=["This is a good example.", "This is a great example."], label=0.8),
    InputExample(texts=["This is another example.", "This is a similar example."], label=0.7),
    # Add more examples as needed
]

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

# Load a pre-trained sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Add dropout by modifying the model's pooling layer
from torch import nn
class CustomPooling(nn.Module):
    def __init__(self, original_pooling):
        super().__init__()
        self.original_pooling = original_pooling
        self.dropout = nn.Dropout(p=0.3)

    def forward(self, features):
        pooled = self.original_pooling(features)
        return self.dropout(pooled)

model._modules['pooling'] = CustomPooling(model._modules['pooling'])

# Define a loss function
train_loss = losses.CosineSimilarityLoss(model)

# Define evaluator for validation
evaluator = evaluation.EmbeddingSimilarityEvaluator.from_input_examples(
    [
        InputExample(texts=["This is a good example.", "This is a nice example."], label=0.9),
        InputExample(texts=["I like apples.", "I hate apples."], label=0.1),
    ],
    name='sts-dev'
)

num_epochs = 10
warmup_steps = int(len(train_dataloader) * num_epochs * 0.1)  # 10% of train data

# Train the model
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    evaluator=evaluator,
    epochs=num_epochs,
    evaluation_steps=100,
    warmup_steps=warmup_steps,
    output_path='./output/sentence_transformer',
    optimizer_params={'lr': 2e-5},
    early_stopping=True,
    early_stopping_patience=2
)

# After training, evaluate final performance
final_score = evaluator(model)
print(f"Final validation score (Spearman correlation): {final_score}")
Added a dropout layer with 30% rate after the pooling layer to reduce overfitting.
Lowered learning rate to 2e-5 for more stable training.
Added early stopping with patience of 2 to stop training when validation stops improving.
Results Interpretation

Before: Training accuracy was 95%, validation accuracy was 70%, showing overfitting.

After: Training accuracy reduced to 90%, validation accuracy improved to 87%, showing better generalization.

Adding dropout and early stopping helps reduce overfitting by preventing the model from memorizing training data and stopping training when validation performance stops improving.
Bonus Experiment
Try using data augmentation techniques on sentences to increase dataset variety and see if validation accuracy improves further.
💡 Hint
You can use synonym replacement or back-translation to create new sentence examples without collecting new data.