0
0
NLPml~20 mins

Document-term matrix in NLP - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Document-Term Matrix Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of Document-Term Matrix Creation
What is the output of the following code that creates a document-term matrix from two simple documents?
NLP
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['apple orange apple', 'orange banana orange']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray())
A
[[1 2 0]
 [0 1 2]]
B
[[1 1 1]
 [1 1 1]]
C
[[2 0 1]
 [0 1 2]]
D
[[2 1 0]
 [0 2 1]]
Attempts:
2 left
💡 Hint
CountVectorizer counts how many times each word appears in each document.
🧠 Conceptual
intermediate
1:30remaining
Understanding Document-Term Matrix Dimensions
If you create a document-term matrix from 5 documents containing a total of 100 unique words, what will be the shape (rows, columns) of the matrix?
A100 rows and 5 columns
B100 rows and 100 columns
C5 rows and 5 columns
D5 rows and 100 columns
Attempts:
2 left
💡 Hint
Rows represent documents, columns represent unique words.
Metrics
advanced
2:00remaining
Choosing the Right Metric for Document-Term Matrix Similarity
Which metric is most appropriate to measure similarity between two document vectors from a document-term matrix when the goal is to find documents with similar topics regardless of length?
ACosine similarity
BEuclidean distance
CManhattan distance
DJaccard index
Attempts:
2 left
💡 Hint
Consider a metric that ignores vector length and focuses on direction.
🔧 Debug
advanced
2:00remaining
Identifying the Error in Document-Term Matrix Code
What error will the following code raise? from sklearn.feature_extraction.text import CountVectorizer corpus = ['cat dog', 'dog mouse'] vectorizer = CountVectorizer() X = vectorizer.fit_transform(corpus) print(X[0, 1])
ANo error, prints the count of the second word in the first document
BTypeError: 'csr_matrix' object is not subscriptable
CAttributeError: 'CountVectorizer' object has no attribute 'fit_transform'
DIndexError: index out of range
Attempts:
2 left
💡 Hint
csr_matrix supports indexing with X[i, j].
Model Choice
expert
2:30remaining
Best Model to Use with Document-Term Matrix for Text Classification
Given a document-term matrix representing text data, which machine learning model is generally most suitable for classifying documents into categories when the data is high-dimensional and sparse?
AK-Nearest Neighbors (KNN)
BSupport Vector Machine (SVM) with linear kernel
CDecision Tree
DNaive Bayes
Attempts:
2 left
💡 Hint
Consider models that handle high-dimensional sparse data well and avoid overfitting.