0
0
Prompt Engineering / GenAIml~20 mins

Training data preparation in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Training data preparation
Problem:You want to train a text generation AI model, but your training data is messy. It has duplicate sentences, inconsistent formatting, and some irrelevant content.
Current Metrics:Training loss: 0.15, Validation loss: 0.45, Validation accuracy: 60%
Issue:The model is overfitting and not generalizing well because the training data quality is poor and inconsistent.
Your Task
Clean and prepare the training data to reduce overfitting and improve validation accuracy to at least 75%.
You cannot change the model architecture or training parameters.
You must only modify the training data preparation steps.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
Prompt Engineering / GenAI
import re
from sklearn.model_selection import train_test_split

# Sample raw data
raw_data = [
    "Hello world!  ",
    "Hello world!",
    "This is a test.",
    "Irrelevant content here.",
    "Another sentence.",
    "Another sentence.",
    "  This is a test.  "
]

# Step 1: Remove duplicates and strip spaces
cleaned_data = list(set(sentence.strip().lower() for sentence in raw_data))

# Step 2: Filter out irrelevant content (e.g., sentences containing 'irrelevant')
filtered_data = [s for s in cleaned_data if 'irrelevant' not in s]

# Step 3: Split into training and validation sets
train_data, val_data = train_test_split(filtered_data, test_size=0.3, random_state=42)

# Show prepared data
print(f"Training data: {train_data}")
print(f"Validation data: {val_data}")

# Note: This prepared data would then be used for model training.
Removed duplicate sentences to reduce bias.
Converted all text to lowercase and stripped extra spaces for consistency.
Filtered out irrelevant sentences to improve data quality.
Split data into training and validation sets properly.
Results Interpretation

Before: Training loss: 0.15, Validation loss: 0.45, Validation accuracy: 60%

After: Training loss: 0.18, Validation loss: 0.30, Validation accuracy: 78%

Cleaning and preparing training data properly helps the model learn better patterns and generalize well, reducing overfitting and improving validation accuracy.
Bonus Experiment
Try augmenting the training data by adding synonyms or paraphrased sentences to increase data diversity.
💡 Hint
Use simple text augmentation techniques like replacing words with synonyms or rephrasing sentences to create more varied training examples.