Bird
Raised Fist0
NLPml~20 mins

TF-IDF (TfidfVectorizer) in NLP - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
TF-IDF Mastery Badge
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of TF-IDF vectorization with simple corpus
What is the shape of the TF-IDF matrix produced by the following code?
NLP
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['apple orange banana', 'banana fruit apple', 'fruit orange']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.shape)
A(3, 4)
B(4, 3)
C(4, 4)
D(3, 3)
Attempts:
2 left
💡 Hint
Remember, rows correspond to documents and columns to unique words.
🧠 Conceptual
intermediate
1:30remaining
Understanding IDF effect on common words
Which statement best describes the effect of IDF (Inverse Document Frequency) in TF-IDF vectorization?
AIDF only considers the length of each document.
BIDF increases the weight of words that appear in many documents.
CIDF decreases the weight of words that appear in many documents.
DIDF assigns equal weight to all words regardless of frequency.
Attempts:
2 left
💡 Hint
Think about how common words like 'the' or 'and' should be treated.
Hyperparameter
advanced
1:30remaining
Effect of max_df parameter in TfidfVectorizer
What is the effect of setting max_df=0.5 in TfidfVectorizer?
AIt normalizes the TF-IDF vectors to have max value 0.5.
BIt ignores words that appear in less than 50% of the documents.
CIt limits the maximum number of features to 50.
DIt ignores words that appear in more than 50% of the documents.
Attempts:
2 left
💡 Hint
max_df controls filtering of very common words.
Metrics
advanced
1:00remaining
Interpreting cosine similarity with TF-IDF vectors
Given two TF-IDF vectors A and B, which cosine similarity value indicates the most similar documents?
A1.0
B0.0
C-1.0
D0.5
Attempts:
2 left
💡 Hint
Cosine similarity ranges from -1 to 1.
🔧 Debug
expert
2:30remaining
Identifying error in TF-IDF vectorization code
What error will the following code raise when executed? from sklearn.feature_extraction.text import TfidfVectorizer corpus = ['data science', 'machine learning'] vectorizer = TfidfVectorizer(stop_words='english') X = vectorizer.fit_transform(corpus) print(X.toarray()) print(vectorizer.get_feature_names_out()) print(vectorizer.stop_words_) print(vectorizer.vocabulary_['science'])
AAttributeError because 'stop_words_' attribute does not exist
BNo error, code runs successfully
CTypeError because stop_words parameter expects a list, not a string
DKeyError because 'science' is removed as a stop word
Attempts:
2 left
💡 Hint
Check if 'science' is considered a stop word and if it exists in vocabulary.

Practice

(1/5)
1. What does the TfidfVectorizer primarily do in text processing?
easy
A. It converts text into numbers reflecting word importance.
B. It translates text into another language.
C. It removes all punctuation from the text.
D. It counts the total number of characters in text.

Solution

  1. Step 1: Understand the purpose of TfidfVectorizer

    TfidfVectorizer transforms text data into numerical values that represent how important each word is in the text.
  2. Step 2: Compare options with this purpose

    Only It converts text into numbers reflecting word importance. describes converting text into numbers that reflect word importance, which matches the function of TfidfVectorizer.
  3. Final Answer:

    It converts text into numbers reflecting word importance. -> Option A
  4. Quick Check:

    TF-IDF = word importance numbers [OK]
Hint: TF-IDF = numbers showing word importance in text [OK]
Common Mistakes:
  • Confusing TF-IDF with translation or punctuation removal
  • Thinking TF-IDF counts characters instead of words
  • Assuming TF-IDF just counts word frequency without weighting
2. Which of the following is the correct way to import TfidfVectorizer from scikit-learn?
easy
A. from sklearn.feature_extraction.text import TfidfVectorizer
B. import TfidfVectorizer from sklearn.text
C. from sklearn.text import TfidfVectorizer
D. import TfidfVectorizer from sklearn.feature_extraction

Solution

  1. Step 1: Recall the correct module for TfidfVectorizer

    TfidfVectorizer is located in sklearn.feature_extraction.text module.
  2. Step 2: Match the correct import syntax

    The correct Python import syntax is: from sklearn.feature_extraction.text import TfidfVectorizer, which matches from sklearn.feature_extraction.text import TfidfVectorizer.
  3. Final Answer:

    from sklearn.feature_extraction.text import TfidfVectorizer -> Option A
  4. Quick Check:

    Correct import path = from sklearn.feature_extraction.text import TfidfVectorizer [OK]
Hint: Remember sklearn.feature_extraction.text for TfidfVectorizer import [OK]
Common Mistakes:
  • Using wrong module names like sklearn.text
  • Incorrect import syntax order
  • Trying to import from sklearn.feature_extraction without .text
3. What will be the shape of the output matrix after applying TfidfVectorizer on 3 documents with 5 unique words total?
medium
A. (5, 5)
B. (5, 3)
C. (3, 3)
D. (3, 5)

Solution

  1. Step 1: Understand TfidfVectorizer output shape

    The output is a matrix where rows represent documents and columns represent unique words (features).
  2. Step 2: Apply to given numbers

    With 3 documents and 5 unique words, the shape is (3, 5) -- 3 rows and 5 columns.
  3. Final Answer:

    (3, 5) -> Option D
  4. Quick Check:

    Output shape = (documents, unique words) = (3, 5) [OK]
Hint: Rows = documents, columns = unique words in TF-IDF matrix [OK]
Common Mistakes:
  • Swapping rows and columns in output shape
  • Confusing number of documents with number of words
  • Assuming square matrix regardless of input
4. Given this code snippet, what is the error?
from sklearn.feature_extraction.text import TfidfVectorizer
texts = ['apple orange', 'orange banana', 'banana apple']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)
print(X.shape)
print(vectorizer.get_feature_names())
medium
A. fit_transform() requires a list of integers, not strings
B. get_feature_names() is deprecated; should use get_feature_names_out()
C. TfidfVectorizer() needs a parameter specifying language
D. print(X.shape) will cause an error because X is not defined

Solution

  1. Step 1: Check method usage for feature names

    In recent scikit-learn versions, get_feature_names() is deprecated and replaced by get_feature_names_out().
  2. Step 2: Verify other code parts

    fit_transform() accepts list of strings, TfidfVectorizer() works without language parameter, and X is defined correctly.
  3. Final Answer:

    get_feature_names() is deprecated; should use get_feature_names_out() -> Option B
  4. Quick Check:

    Use get_feature_names_out() instead of deprecated get_feature_names() [OK]
Hint: Use get_feature_names_out() for feature names in new sklearn versions [OK]
Common Mistakes:
  • Using deprecated get_feature_names() causing warnings or errors
  • Thinking fit_transform() needs numeric input
  • Assuming language parameter is mandatory
5. You want to ignore very common words like 'the' and 'is' when using TfidfVectorizer. Which parameter helps you do this effectively?
hard
A. lowercase=false
B. max_features=1000
C. stop_words='english'
D. norm=null

Solution

  1. Step 1: Identify parameter for ignoring common words

    The stop_words parameter removes common words (stop words) like 'the', 'is', 'and'. Setting stop_words='english' removes English stop words.
  2. Step 2: Check other parameters

    max_features limits number of features but doesn't remove stop words; lowercase controls case; norm controls normalization, none remove stop words.
  3. Final Answer:

    stop_words='english' -> Option C
  4. Quick Check:

    stop_words='english' removes common words [OK]
Hint: Use stop_words='english' to skip common words [OK]
Common Mistakes:
  • Confusing max_features with stop words removal
  • Not using stop_words parameter at all
  • Thinking lowercase removes stop words