ML Pythonml~12 mins

Bag of Words and TF-IDF in ML Python - Model Pipeline Trace

Choose your learning style9 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Model Pipeline - Bag of Words and TF-IDF

This pipeline converts text into numbers using Bag of Words and TF-IDF methods, then trains a simple model to classify text. It shows how raw words become useful numbers for learning.

Data Flow - 5 Stages

1Raw Text Input

5 rows x 1 column→Collect raw text sentences→5 rows x 1 column

["I love apples", "Apples are tasty", "I hate bananas", "Bananas are yellow", "I love yellow apples"]

↓

2Bag of Words Vectorization

5 rows x 1 column→Convert text to word count vectors→5 rows x 8 columns

[[1,1,0,0,0,0,0,0], [0,1,1,0,0,0,0,0], [1,0,0,1,0,0,0,0], [0,0,0,0,1,1,0,0], [1,1,0,0,0,0,1,1]]

↓

3TF-IDF Transformation

5 rows x 8 columns→Scale word counts by importance across documents→5 rows x 8 columns

[[0.58,0.58,0,0,0,0,0,0], [0,0.69,0.69,0,0,0,0,0], [0.58,0,0,0.81,0,0,0,0], [0,0,0,0,0.81,0.81,0,0], [0.45,0.45,0,0,0,0,0.67,0.67]]

↓

4Train/Test Split

5 rows x 8 columns→Split data into training and testing sets→3 rows x 8 columns (train), 2 rows x 8 columns (test)

Train: 3 samples, Test: 2 samples

↓

5Model Training

3 rows x 8 columns→Train a logistic regression classifier→Trained model

Model learns to separate positive and negative sentiment

Training Trace - Epoch by Epoch

Loss
0.7 |****
0.6 |*** 
0.5 |**  
0.4 |*   
0.3 |**  
     1 2 3 4 5 Epochs

Epoch	Loss ↓	Accuracy ↑	Observation
1	0.65	0.60	Model starts learning with moderate loss and accuracy
2	0.48	0.75	Loss decreases and accuracy improves
3	0.35	0.85	Model converges with better performance
4	0.30	0.90	Loss lowers further, accuracy near optimal
5	0.28	0.92	Training stabilizes with high accuracy

Prediction Trace - 5 Layers

Layer 1: Input Text

Layer 2: Bag of Words Vectorization

Layer 3: TF-IDF Transformation

Layer 4: Model Prediction

Layer 5: Thresholding

Model Quiz - 3 Questions

Test your understanding

What does the TF-IDF step do to the word counts?

AIt changes words into random numbers

BIt removes all words that appear only once

CIt scales counts by how common words are across all texts

DIt sorts words alphabetically

Key Insight

Bag of Words counts words, but TF-IDF adjusts these counts to highlight important words. This helps the model learn better by focusing on meaningful words, improving prediction accuracy.