0
0
ML Pythonml~5 mins

Text feature basics (CountVectorizer, TF-IDF) in ML Python - Cheat Sheet & Quick Revision

Choose your learning style9 modes available
Recall & Review
beginner
What does CountVectorizer do in text processing?
CountVectorizer converts a collection of text documents into a matrix of token counts. It counts how many times each word appears in each document.
Click to reveal answer
beginner
Explain TF-IDF in simple terms.
TF-IDF stands for Term Frequency-Inverse Document Frequency. It measures how important a word is to a document compared to all documents, giving higher scores to words that appear often in one document but rarely in others.
Click to reveal answer
intermediate
Why use TF-IDF instead of just counting words?
Because some words like 'the' or 'and' appear in almost every document, counting them doesn't help. TF-IDF reduces the weight of common words and highlights unique words that better describe the document.
Click to reveal answer
beginner
What is the output format of CountVectorizer and TF-IDF Vectorizer?
Both output a matrix where rows represent documents and columns represent words (features). Each cell contains either the count of the word (CountVectorizer) or the TF-IDF score (TF-IDF Vectorizer).
Click to reveal answer
intermediate
How does CountVectorizer handle different words like 'run' and 'running'?
By default, CountVectorizer treats 'run' and 'running' as different words. To group them, you can use techniques like stemming or lemmatization before vectorizing.
Click to reveal answer
What does CountVectorizer count in text data?
ANumber of characters in each document
BImportance of words across all documents
CNumber of times each word appears in each document
DNumber of sentences in each document
What does TF-IDF help to identify in text data?
AWords that are important and unique to a document
BLength of each sentence
CTotal number of words in a document
DCommon words that appear in all documents
Which of these is a limitation of CountVectorizer without preprocessing?
AIt normalizes word forms
BIt groups similar words automatically
CIt removes stop words by default
DIt ignores word order
What is the shape of the output matrix from CountVectorizer for 100 documents and 500 unique words?
A500 x 100
B100 x 500
C100 x 100
D500 x 500
Which step can improve CountVectorizer results by grouping word forms?
AStemming or Lemmatization
BTokenization
CStop word removal
DLowercasing
Describe how CountVectorizer transforms text data into numbers.
Think about counting words in each document and organizing them in a table.
You got /4 concepts.
    Explain why TF-IDF is useful compared to simple word counts.
    Consider how common words like 'the' are treated differently.
    You got /3 concepts.