Recall & Review

beginner

What does CountVectorizer do in text processing?

CountVectorizer converts a collection of text documents into a matrix of token counts. It counts how many times each word appears in each document.

Click to reveal answer

beginner

Explain TF-IDF in simple terms.

TF-IDF stands for Term Frequency-Inverse Document Frequency. It measures how important a word is to a document compared to all documents, giving higher scores to words that appear often in one document but rarely in others.

Click to reveal answer

intermediate

Why use TF-IDF instead of just counting words?

Because some words like 'the' or 'and' appear in almost every document, counting them doesn't help. TF-IDF reduces the weight of common words and highlights unique words that better describe the document.

Click to reveal answer

beginner

What is the output format of CountVectorizer and TF-IDF Vectorizer?

Both output a matrix where rows represent documents and columns represent words (features). Each cell contains either the count of the word (CountVectorizer) or the TF-IDF score (TF-IDF Vectorizer).

Click to reveal answer

intermediate

How does CountVectorizer handle different words like 'run' and 'running'?

By default, CountVectorizer treats 'run' and 'running' as different words. To group them, you can use techniques like stemming or lemmatization before vectorizing.

Click to reveal answer

What does CountVectorizer count in text data?

ANumber of characters in each document

BImportance of words across all documents

CNumber of times each word appears in each document

DNumber of sentences in each document

What does TF-IDF help to identify in text data?

AWords that are important and unique to a document

BLength of each sentence

CTotal number of words in a document

DCommon words that appear in all documents

Which of these is a limitation of CountVectorizer without preprocessing?

AIt normalizes word forms

BIt groups similar words automatically

CIt removes stop words by default

DIt ignores word order

What is the shape of the output matrix from CountVectorizer for 100 documents and 500 unique words?

A500 x 100

B100 x 500

C100 x 100

D500 x 500

Which step can improve CountVectorizer results by grouping word forms?

AStemming or Lemmatization

BTokenization

CStop word removal

DLowercasing

Describe how CountVectorizer transforms text data into numbers.

Explain why TF-IDF is useful compared to simple word counts.

Practice

(1/5)

1. What does CountVectorizer do in text processing?

easy

A. Calculates the importance of words based on frequency and rarity

B. Counts how many times each word appears in the text

C. Removes stop words from the text

D. Converts text into lowercase only

Text feature basics (CountVectorizer, TF-IDF) in ML Python - Cheat Sheet & Quick Revision

Start learning this pattern below

Practice

Solution

Step 1: Understand CountVectorizer's role

Step 2: Differentiate from TF-IDF

Final Answer:

Quick Check:

Solution

Step 1: Recall correct sklearn import path

Step 2: Check syntax correctness

Final Answer:

Quick Check:

Solution

Step 1: Count unique words in sentences

Step 2: Understand shape of output matrix

Final Answer:

Quick Check:

Solution

Step 1: Check method usage for feature names

Step 2: Use updated method

Final Answer:

Quick Check:

Solution

Step 1: Understand the goal of reducing common word impact

Step 2: Identify method that weighs words by importance

Final Answer:

Quick Check: