Prompt Engineering / GenAIml~8 mins

Training data preparation in Prompt Engineering / GenAI - Model Metrics & Evaluation

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Metrics & Evaluation - Training data preparation

Which metric matters for Training Data Preparation and WHY

When preparing training data, the key metric to watch is data quality. This means how clean, balanced, and relevant your data is. Good data helps your model learn well and make correct predictions.

Metrics like class balance (how evenly classes are represented) and missing value rate (how much data is incomplete) matter a lot. If your data is messy or biased, your model's accuracy, precision, and recall will suffer.

Confusion Matrix or Equivalent Visualization

Confusion Matrix Example (after training with good data):

          Predicted
          Pos   Neg
Actual Pos  90    10
       Neg  15    85

- Total samples = 90 + 10 + 15 + 85 = 200
- Precision = 90 / (90 + 15) = 0.857
- Recall = 90 / (90 + 10) = 0.9

If training data is poor, these numbers drop, showing the model learned wrong patterns.

Precision vs Recall Tradeoff with Examples

Good training data helps balance precision and recall. For example:

Spam filter: High precision means few good emails marked as spam. Training data must include many examples of real spam and real emails.
Medical diagnosis: High recall means catching most sick patients. Training data must have enough positive cases to teach the model.

If training data is biased or missing classes, the model may have high precision but low recall, or vice versa.

What "Good" vs "Bad" Metric Values Look Like for Training Data Preparation

Good training data:

Balanced classes (e.g., 50% positive, 50% negative)
Low missing data (<5%)
Clear, correct labels
Results in model metrics: accuracy > 85%, precision and recall both > 80%

Bad training data:

Highly imbalanced classes (e.g., 95% negative, 5% positive)
Lots of missing or noisy data (>20%)
Incorrect or inconsistent labels
Results in model metrics: accuracy high but recall or precision very low (e.g., recall < 50%)

Common Metrics Pitfalls in Training Data Preparation

Accuracy paradox: High accuracy can be misleading if classes are imbalanced. For example, 95% accuracy if model always predicts the majority class.
Data leakage: When test data leaks into training, metrics look perfect but model fails in real use.
Overfitting indicators: Very high training accuracy but low test accuracy means model memorized bad data instead of learning.
Ignoring class balance: Leads to poor recall or precision on minority classes.

Self-Check Question

Your model has 98% accuracy but only 12% recall on fraud cases. Is it good for production? Why or why not?

Answer: No, it is not good. The model misses 88% of fraud cases (low recall), which is dangerous. High accuracy is misleading because fraud is rare (class imbalance). You need better training data to improve recall.

Key Result

Good training data leads to balanced precision and recall, avoiding misleading high accuracy from poor data.

Practice

(1/5)

1. What is the main purpose of training data preparation in machine learning?

easy

A. To clean and organize data for better model learning

B. To create the final model architecture

C. To deploy the model to production

D. To write the code for model training

Training data preparation in Prompt Engineering / GenAI - Model Metrics & Evaluation

Start learning this pattern below

Practice

Solution

Step 1: Understand the role of training data preparation

Step 2: Differentiate from other steps in machine learning

Final Answer:

Quick Check:

Solution

Step 1: Recall the scikit-learn function for splitting data

Step 2: Check the syntax of each option

Final Answer:

Quick Check:

Solution

Step 1: Understand the data shape and split ratio

Step 2: Calculate the shapes of training and testing sets

Final Answer:

Quick Check:

Solution

Step 1: Check input data type compatibility

Step 2: Verify method usage

Final Answer:

Quick Check:

Solution

Step 1: Clean missing values first

Step 2: Encode categorical features before normalization

Step 3: Normalize numeric features and then split data

Final Answer:

Quick Check: