0
0
Prompt Engineering / GenAIml~12 mins

Training data preparation in Prompt Engineering / GenAI - Model Pipeline Trace

Choose your learning style9 modes available
Model Pipeline - Training data preparation

This pipeline shows how raw data is cleaned and organized before training a machine learning model. It prepares the data so the model can learn well.

Data Flow - 6 Stages
1Raw data collection
1000 rows x 10 columnsCollect data from various sources with missing and noisy values1000 rows x 10 columns
Row example: {"age": 25, "income": null, "gender": "M", ...}
2Data cleaning
1000 rows x 10 columnsFill missing values and remove duplicates980 rows x 10 columns
Row example: {"age": 25, "income": 50000, "gender": "M", ...}
3Feature selection
980 rows x 10 columnsKeep only relevant columns for prediction980 rows x 6 columns
Columns kept: age, income, gender, education, hours_worked, target
4Data encoding
980 rows x 6 columnsConvert categorical data to numbers980 rows x 8 columns
Gender 'M' -> 1, 'F' -> 0; Education levels one-hot encoded
5Data normalization
980 rows x 8 columnsScale numeric features to 0-1 range980 rows x 8 columns
Age scaled from 0 to 1, income scaled from 0 to 1
6Train/test split
980 rows x 8 columnsSplit data into training and testing sets784 rows x 8 columns (train), 196 rows x 8 columns (test)
Training set example row: {age: 0.3, income: 0.5, gender: 1, ...}
Training Trace - Epoch by Epoch

Loss
0.7 |****
0.6 |****
0.5 |***
0.4 |**
0.3 |*
    +------------
     1 2 3 4 5 Epochs
EpochLoss ↓Accuracy ↑Observation
10.650.60Model starts learning with moderate loss and accuracy
20.500.72Loss decreases and accuracy improves as model learns
30.400.80Model continues to improve with lower loss and higher accuracy
40.350.85Training converges with good accuracy and low loss
50.320.87Final epoch shows stable loss and accuracy
Prediction Trace - 3 Layers
Layer 1: Input features
Layer 2: Hidden layer with ReLU activation
Layer 3: Output layer with sigmoid activation
Model Quiz - 3 Questions
Test your understanding
What happens to missing values during data cleaning?
AThey are filled or removed
BThey are left as is
CThey are converted to zeros
DThey are duplicated
Key Insight
Preparing data carefully by cleaning, selecting, encoding, and normalizing helps the model learn better and faster. Good data preparation leads to lower loss and higher accuracy during training.