Recall & Review
beginner
What does CountVectorizer do in text processing?
CountVectorizer converts a collection of text documents into a matrix of token counts. It counts how many times each word appears in each document.
Click to reveal answer
beginner
Explain TF-IDF in simple terms.
TF-IDF stands for Term Frequency-Inverse Document Frequency. It measures how important a word is to a document compared to all documents, giving higher scores to words that appear often in one document but rarely in others.
Click to reveal answer
intermediate
Why use TF-IDF instead of just counting words?
Because some words like 'the' or 'and' appear in almost every document, counting them doesn't help. TF-IDF reduces the weight of common words and highlights unique words that better describe the document.
Click to reveal answer
beginner
What is the output format of CountVectorizer and TF-IDF Vectorizer?
Both output a matrix where rows represent documents and columns represent words (features). Each cell contains either the count of the word (CountVectorizer) or the TF-IDF score (TF-IDF Vectorizer).
Click to reveal answer
intermediate
How does CountVectorizer handle different words like 'run' and 'running'?
By default, CountVectorizer treats 'run' and 'running' as different words. To group them, you can use techniques like stemming or lemmatization before vectorizing.
Click to reveal answer
What does CountVectorizer count in text data?
✗ Incorrect
CountVectorizer counts how many times each word appears in each document, creating a matrix of word counts.
What does TF-IDF help to identify in text data?
✗ Incorrect
TF-IDF highlights words that are important and unique to a document by reducing the weight of common words.
Which of these is a limitation of CountVectorizer without preprocessing?
✗ Incorrect
CountVectorizer ignores word order and treats each word separately unless preprocessing like stemming is applied.
What is the shape of the output matrix from CountVectorizer for 100 documents and 500 unique words?
✗ Incorrect
The output matrix has rows as documents and columns as unique words, so 100 documents by 500 words.
Which step can improve CountVectorizer results by grouping word forms?
✗ Incorrect
Stemming or lemmatization groups different forms of a word (like 'run' and 'running') into one base form.
Describe how CountVectorizer transforms text data into numbers.
Think about counting words in each document and organizing them in a table.
You got /4 concepts.
Explain why TF-IDF is useful compared to simple word counts.
Consider how common words like 'the' are treated differently.
You got /3 concepts.