0
0
NLPml~20 mins

TF-IDF (TfidfVectorizer) in NLP - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
TF-IDF Mastery Badge
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of TF-IDF vectorization with simple corpus
What is the shape of the TF-IDF matrix produced by the following code?
NLP
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['apple orange banana', 'banana fruit apple', 'fruit orange']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.shape)
A(3, 4)
B(4, 3)
C(4, 4)
D(3, 3)
Attempts:
2 left
💡 Hint
Remember, rows correspond to documents and columns to unique words.
🧠 Conceptual
intermediate
1:30remaining
Understanding IDF effect on common words
Which statement best describes the effect of IDF (Inverse Document Frequency) in TF-IDF vectorization?
AIDF only considers the length of each document.
BIDF increases the weight of words that appear in many documents.
CIDF decreases the weight of words that appear in many documents.
DIDF assigns equal weight to all words regardless of frequency.
Attempts:
2 left
💡 Hint
Think about how common words like 'the' or 'and' should be treated.
Hyperparameter
advanced
1:30remaining
Effect of max_df parameter in TfidfVectorizer
What is the effect of setting max_df=0.5 in TfidfVectorizer?
AIt normalizes the TF-IDF vectors to have max value 0.5.
BIt ignores words that appear in less than 50% of the documents.
CIt limits the maximum number of features to 50.
DIt ignores words that appear in more than 50% of the documents.
Attempts:
2 left
💡 Hint
max_df controls filtering of very common words.
Metrics
advanced
1:00remaining
Interpreting cosine similarity with TF-IDF vectors
Given two TF-IDF vectors A and B, which cosine similarity value indicates the most similar documents?
A1.0
B0.0
C-1.0
D0.5
Attempts:
2 left
💡 Hint
Cosine similarity ranges from -1 to 1.
🔧 Debug
expert
2:30remaining
Identifying error in TF-IDF vectorization code
What error will the following code raise when executed? from sklearn.feature_extraction.text import TfidfVectorizer corpus = ['data science', 'machine learning'] vectorizer = TfidfVectorizer(stop_words='english') X = vectorizer.fit_transform(corpus) print(X.toarray()) print(vectorizer.get_feature_names_out()) print(vectorizer.stop_words_) print(vectorizer.vocabulary_['science'])
AAttributeError because 'stop_words_' attribute does not exist
BNo error, code runs successfully
CTypeError because stop_words parameter expects a list, not a string
DKeyError because 'science' is removed as a stop word
Attempts:
2 left
💡 Hint
Check if 'science' is considered a stop word and if it exists in vocabulary.