Challenge - 5 Problems
TF-IDF Mastery Badge
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of TF-IDF vectorization with simple corpus
What is the shape of the TF-IDF matrix produced by the following code?
NLP
from sklearn.feature_extraction.text import TfidfVectorizer corpus = ['apple orange banana', 'banana fruit apple', 'fruit orange'] vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(corpus) print(X.shape)
Attempts:
2 left
💡 Hint
Remember, rows correspond to documents and columns to unique words.
✗ Incorrect
The corpus has 3 documents and 4 unique words: 'apple', 'orange', 'banana', 'fruit'. So the TF-IDF matrix shape is (3, 4).
🧠 Conceptual
intermediate1:30remaining
Understanding IDF effect on common words
Which statement best describes the effect of IDF (Inverse Document Frequency) in TF-IDF vectorization?
Attempts:
2 left
💡 Hint
Think about how common words like 'the' or 'and' should be treated.
✗ Incorrect
IDF reduces the weight of words that appear in many documents to highlight more unique words.
❓ Hyperparameter
advanced1:30remaining
Effect of max_df parameter in TfidfVectorizer
What is the effect of setting max_df=0.5 in TfidfVectorizer?
Attempts:
2 left
💡 Hint
max_df controls filtering of very common words.
✗ Incorrect
max_df=0.5 means words appearing in more than half the documents are ignored as too common.
❓ Metrics
advanced1:00remaining
Interpreting cosine similarity with TF-IDF vectors
Given two TF-IDF vectors A and B, which cosine similarity value indicates the most similar documents?
Attempts:
2 left
💡 Hint
Cosine similarity ranges from -1 to 1.
✗ Incorrect
Cosine similarity of 1 means vectors point in the same direction, indicating maximum similarity.
🔧 Debug
expert2:30remaining
Identifying error in TF-IDF vectorization code
What error will the following code raise when executed?
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['data science', 'machine learning']
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
print(X.toarray())
print(vectorizer.get_feature_names_out())
print(vectorizer.stop_words_)
print(vectorizer.vocabulary_['science'])
Attempts:
2 left
💡 Hint
Check if 'science' is considered a stop word and if it exists in vocabulary.
✗ Incorrect
No error occurs. 'science' is not an English stopword, so it remains in the vocabulary. The stop_words_ attribute exists after fitting, and stop_words='english' is valid.