Practice

(1/5)

1. What does a document-term matrix represent in natural language processing?

easy

A. The length of each document

B. The order of words in a sentence

C. The meaning of each word

D. Counts of words in each document

Solution

Step 1: Understand the purpose of a document-term matrix
A document-term matrix counts how many times each word appears in each document.
Step 2: Compare options with this definition
Only Counts of words in each document correctly describes this counting process.
Final Answer:
Counts of words in each document -> Option D
Quick Check:
Document-term matrix = word counts [OK]

Hint: Remember: matrix counts words per document [OK]

Common Mistakes:

Confusing word order with counts
Thinking it shows word meanings
Assuming it measures document length

2. Which Python library provides the CountVectorizer class to create a document-term matrix?

easy

A. numpy

B. pandas

C. scikit-learn

D. matplotlib

Solution

Step 1: Recall the library for text feature extraction
CountVectorizer is part of scikit-learn, a popular machine learning library.
Step 2: Verify other options
numpy is for arrays, pandas for data frames, matplotlib for plotting, so they don't provide CountVectorizer.
Final Answer:
scikit-learn -> Option C
Quick Check:
CountVectorizer = scikit-learn [OK]

Hint: CountVectorizer is from scikit-learn, not numpy [OK]

Common Mistakes:

Choosing numpy because it handles arrays
Confusing pandas with text vectorization
Selecting matplotlib for visualization

3. What is the output of this Python code snippet?

from sklearn.feature_extraction.text import CountVectorizer
texts = ['cat dog', 'dog dog cat']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
print(X.toarray())

medium

A. [[1 1] [1 2]]

B. [[1 1] [2 1]]

C. [[2 1] [1 2]]

D. [[1 2] [1 1]]

Solution

Step 1: Identify the vocabulary and word counts
The texts are 'cat dog' and 'dog dog cat'. Vocabulary sorted alphabetically is ['cat', 'dog']. First document has 1 'cat' and 1 'dog'. Second document has 1 'cat' and 2 'dog's.
Step 2: Form the document-term matrix
Matrix rows correspond to documents, columns to words: [[1,1],[1,2]].
Final Answer:
[[1 1] [1 2]] -> Option A
Quick Check:
Word counts match matrix [OK]

Hint: Count words per document in alphabetical order [OK]

Common Mistakes:

Mixing order of words in vocabulary
Counting wrong number of word occurrences
Confusing rows and columns

4. Identify the error in this code that tries to create a document-term matrix:

from sklearn.feature_extraction.text import CountVectorizer
texts = ['apple orange', 'orange apple apple']
vectorizer = CountVectorizer()
X = vectorizer.transform(texts)
print(X.toarray())

medium

A. toarray() is not a method of X

B. Missing fit() before transform()

C. texts should be a single string, not a list

D. CountVectorizer() should be CountVector()

Solution

Step 1: Understand CountVectorizer usage
CountVectorizer requires calling fit() or fit_transform() before transform() to learn vocabulary.
Step 2: Check the code sequence
The code calls transform() directly without fit(), causing an error.
Final Answer:
Missing fit() before transform() -> Option B
Quick Check:
fit() needed before transform() [OK]

Hint: Always fit before transform with CountVectorizer [OK]

Common Mistakes:

Skipping fit() step
Using wrong class name
Passing wrong data type to vectorizer

5. You have three documents: ['sun moon', 'moon moon sun', 'star sun moon']. Using CountVectorizer, what is the shape of the document-term matrix and which word has the highest total count across all documents?

hard

A. Shape (3, 3), 'moon' has highest count

B. Shape (3, 4), 'sun' has highest count

C. Shape (3, 3), 'sun' has highest count

D. Shape (3, 4), 'moon' has highest count

Solution

Step 1: Identify unique words and matrix shape
Unique words are 'sun', 'moon', 'star' -> 3 words. There are 3 documents, so shape is (3, 3).
Step 2: Count total occurrences of each word
'sun': appears 1 + 1 + 1 = 3 times 'moon': appears 1 + 2 + 1 = 4 times 'star': appears 0 + 0 + 1 = 1 time Highest count is 'moon' with 4.
Final Answer:
Shape (3, 3), 'moon' has highest count -> Option A
Quick Check:
3 docs x 3 words, moon count highest [OK]

Hint: Count unique words for shape, sum counts for highest word [OK]

Common Mistakes:

Counting duplicate words as unique
Mixing up shape dimensions
Incorrectly summing word counts

Why Document-term matrix in NLP? - Purpose & Use Cases

Start learning this pattern below

Practice

Solution

Step 1: Understand the purpose of a document-term matrix

Step 2: Compare options with this definition

Final Answer:

Quick Check:

Solution

Step 1: Recall the library for text feature extraction

Step 2: Verify other options

Final Answer:

Quick Check:

Solution

Step 1: Identify the vocabulary and word counts

Step 2: Form the document-term matrix

Final Answer:

Quick Check:

Solution

Step 1: Understand CountVectorizer usage

Step 2: Check the code sequence

Final Answer:

Quick Check:

Solution

Step 1: Identify unique words and matrix shape

Step 2: Count total occurrences of each word

Final Answer:

Quick Check: