0
0
NLPml~20 mins

TF-IDF (TfidfVectorizer) in NLP - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - TF-IDF (TfidfVectorizer)
Problem:You want to convert a collection of text documents into numbers that show how important each word is in each document. You are using TF-IDF to do this.
Current Metrics:The current TF-IDF vectorizer uses default settings and creates very large feature vectors with many unimportant words included.
Issue:The model is slow and the vectors are too large because many common words that do not help distinguish documents are included.
Your Task
Improve the TF-IDF vectorizer by reducing the number of features while keeping important words, to make the vectors smaller and more meaningful.
You must use TfidfVectorizer from sklearn.
You cannot remove the TF-IDF method itself.
You should not reduce the dataset size.
Hint 1
Hint 2
Hint 3
Solution
NLP
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
texts = [
    "The cat sat on the mat.",
    "The dog ate my homework.",
    "Cats and dogs are great pets.",
    "I love my pet cat."
]

# Improved TF-IDF vectorizer with stop words removal and max features
vectorizer = TfidfVectorizer(stop_words='english', max_features=10)

# Fit and transform the texts
X = vectorizer.fit_transform(texts)

# Show feature names and the TF-IDF matrix shape
features = vectorizer.get_feature_names_out()
shape = X.shape

print(f"Features: {features}")
print(f"TF-IDF matrix shape: {shape}")
Added stop_words='english' to remove common words like 'the', 'and', 'is'.
Set max_features=10 to limit the number of features to the 10 most important words.
Results Interpretation

Before: TF-IDF matrix shape was (4, 20) with many common and rare words included.

After: TF-IDF matrix shape is (4, 10) with stop words removed and rare words ignored.

Removing stop words and limiting features helps create smaller, more meaningful TF-IDF vectors that improve model speed and focus on important words.
Bonus Experiment
Try using n-grams (like bigrams) in the TF-IDF vectorizer to capture word pairs and see if it improves text representation.
💡 Hint
Set the 'ngram_range' parameter to (1, 2) in TfidfVectorizer to include single words and pairs of words.