Bird
Raised Fist0
ML Pythonml~5 mins

Text feature basics (CountVectorizer, TF-IDF) in ML Python - Cheat Sheet & Quick Revision

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Recall & Review
beginner
What does CountVectorizer do in text processing?
CountVectorizer converts a collection of text documents into a matrix of token counts. It counts how many times each word appears in each document.
Click to reveal answer
beginner
Explain TF-IDF in simple terms.
TF-IDF stands for Term Frequency-Inverse Document Frequency. It measures how important a word is to a document compared to all documents, giving higher scores to words that appear often in one document but rarely in others.
Click to reveal answer
intermediate
Why use TF-IDF instead of just counting words?
Because some words like 'the' or 'and' appear in almost every document, counting them doesn't help. TF-IDF reduces the weight of common words and highlights unique words that better describe the document.
Click to reveal answer
beginner
What is the output format of CountVectorizer and TF-IDF Vectorizer?
Both output a matrix where rows represent documents and columns represent words (features). Each cell contains either the count of the word (CountVectorizer) or the TF-IDF score (TF-IDF Vectorizer).
Click to reveal answer
intermediate
How does CountVectorizer handle different words like 'run' and 'running'?
By default, CountVectorizer treats 'run' and 'running' as different words. To group them, you can use techniques like stemming or lemmatization before vectorizing.
Click to reveal answer
What does CountVectorizer count in text data?
ANumber of characters in each document
BImportance of words across all documents
CNumber of times each word appears in each document
DNumber of sentences in each document
What does TF-IDF help to identify in text data?
AWords that are important and unique to a document
BLength of each sentence
CTotal number of words in a document
DCommon words that appear in all documents
Which of these is a limitation of CountVectorizer without preprocessing?
AIt normalizes word forms
BIt groups similar words automatically
CIt removes stop words by default
DIt ignores word order
What is the shape of the output matrix from CountVectorizer for 100 documents and 500 unique words?
A500 x 100
B100 x 500
C100 x 100
D500 x 500
Which step can improve CountVectorizer results by grouping word forms?
AStemming or Lemmatization
BTokenization
CStop word removal
DLowercasing
Describe how CountVectorizer transforms text data into numbers.
Think about counting words in each document and organizing them in a table.
You got /4 concepts.
    Explain why TF-IDF is useful compared to simple word counts.
    Consider how common words like 'the' are treated differently.
    You got /3 concepts.

      Practice

      (1/5)
      1. What does CountVectorizer do in text processing?
      easy
      A. Calculates the importance of words based on frequency and rarity
      B. Counts how many times each word appears in the text
      C. Removes stop words from the text
      D. Converts text into lowercase only

      Solution

      1. Step 1: Understand CountVectorizer's role

        CountVectorizer transforms text into a matrix of token counts, counting word occurrences.
      2. Step 2: Differentiate from TF-IDF

        Unlike TF-IDF, it does not weigh words by importance, only counts frequency.
      3. Final Answer:

        Counts how many times each word appears in the text -> Option B
      4. Quick Check:

        CountVectorizer = word counts [OK]
      Hint: CountVectorizer counts words, TF-IDF scores importance [OK]
      Common Mistakes:
      • Confusing CountVectorizer with TF-IDF
      • Thinking it removes stop words by default
      • Assuming it normalizes text only
      2. Which of the following is the correct way to import and create a CountVectorizer in Python?
      easy
      A. from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer()
      B. import CountVectorizer from sklearn.text vectorizer = CountVectorizer()
      C. from sklearn.text import CountVectorizer vectorizer = CountVectorizer()
      D. import CountVectorizer vectorizer = CountVectorizer()

      Solution

      1. Step 1: Recall correct sklearn import path

        CountVectorizer is in sklearn.feature_extraction.text module.
      2. Step 2: Check syntax correctness

        from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() uses correct import and instantiation syntax.
      3. Final Answer:

        from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() -> Option A
      4. Quick Check:

        Correct import path and syntax [OK]
      Hint: CountVectorizer is in sklearn.feature_extraction.text [OK]
      Common Mistakes:
      • Using wrong module path for import
      • Incorrect import syntax (like import ... from ...)
      • Forgetting to instantiate the class
      3. What will be the output shape of the matrix after applying CountVectorizer on these two sentences?
      sentences = ["I love cats", "Cats love me"]
      vectorizer = CountVectorizer()
      X = vectorizer.fit_transform(sentences)
      print(X.shape)
      medium
      A. (2, 4)
      B. (2, 3)
      C. (3, 2)
      D. (4, 2)

      Solution

      1. Step 1: Count unique words in sentences

        Words are: 'i', 'love', 'cats', 'me' -> 4 unique words.
      2. Step 2: Understand shape of output matrix

        There are 2 sentences (rows) and 4 unique words (columns), so shape is (2, 4).
      3. Final Answer:

        (2, 4) -> Option A
      4. Quick Check:

        Rows = sentences, columns = unique words [OK]
      Hint: Shape = (number of texts, unique words) [OK]
      Common Mistakes:
      • Mixing rows and columns in shape
      • Counting duplicate words multiple times
      • Ignoring case sensitivity (CountVectorizer lowercases by default)
      4. Identify the error in this TF-IDF code snippet:
      from sklearn.feature_extraction.text import TfidfVectorizer
      texts = ["apple banana apple", "banana fruit"]
      tfidf = TfidfVectorizer()
      X = tfidf.fit_transform(texts)
      print(tfidf.get_feature_names())
      medium
      A. fit_transform() should be called on texts as a string, not list
      B. TfidfVectorizer() requires stop_words parameter
      C. get_feature_names() is deprecated, should use get_feature_names_out()
      D. Import statement is incorrect

      Solution

      1. Step 1: Check method usage for feature names

        In recent sklearn versions, get_feature_names() is deprecated.
      2. Step 2: Use updated method

        Use get_feature_names_out() instead to get feature names without error.
      3. Final Answer:

        get_feature_names() is deprecated, should use get_feature_names_out() -> Option C
      4. Quick Check:

        Use get_feature_names_out() for TF-IDF features [OK]
      Hint: Use get_feature_names_out() with TF-IDF [OK]
      Common Mistakes:
      • Using deprecated get_feature_names() method
      • Passing wrong data type to fit_transform
      • Incorrect import paths
      5. You want to transform text data so that common words like 'the' and 'is' have less impact, but rare important words have higher scores. Which method should you use?
      hard
      A. One-hot encoding of words
      B. CountVectorizer without stop words
      C. Raw word counts from CountVectorizer
      D. TF-IDF Vectorizer

      Solution

      1. Step 1: Understand the goal of reducing common word impact

        Common words appear frequently but carry less meaning, so their impact should be lowered.
      2. Step 2: Identify method that weighs words by importance

        TF-IDF scores words higher if they are rare and important, reducing common word impact.
      3. Final Answer:

        TF-IDF Vectorizer -> Option D
      4. Quick Check:

        TF-IDF = importance weighting [OK]
      Hint: Use TF-IDF to weigh rare words higher [OK]
      Common Mistakes:
      • Using raw counts which treat all words equally
      • Assuming stop words removal alone solves importance
      • Confusing one-hot encoding with frequency weighting