Challenge - 5 Problems
Document-Term Matrix Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of Document-Term Matrix Creation
What is the output of the following code that creates a document-term matrix from two simple documents?
NLP
from sklearn.feature_extraction.text import CountVectorizer corpus = ['apple orange apple', 'orange banana orange'] vectorizer = CountVectorizer() X = vectorizer.fit_transform(corpus) print(X.toarray())
Attempts:
2 left
💡 Hint
CountVectorizer counts how many times each word appears in each document.
✗ Incorrect
The vectorizer finds three unique words: 'apple', 'banana', and 'orange'. The first document has 'apple' twice and 'orange' once, so its vector is [2, 0, 1]. The second document has 'orange' twice and 'banana' once, so its vector is [0, 1, 2]. However, the order of words is alphabetical: 'apple', 'banana', 'orange'. So the matrix is [[2, 0, 1], [0, 1, 2]]. The printed output matches option C.
🧠 Conceptual
intermediate1:30remaining
Understanding Document-Term Matrix Dimensions
If you create a document-term matrix from 5 documents containing a total of 100 unique words, what will be the shape (rows, columns) of the matrix?
Attempts:
2 left
💡 Hint
Rows represent documents, columns represent unique words.
✗ Incorrect
A document-term matrix has one row per document and one column per unique word found in all documents. So with 5 documents and 100 unique words, the matrix shape is (5, 100).
❓ Metrics
advanced2:00remaining
Choosing the Right Metric for Document-Term Matrix Similarity
Which metric is most appropriate to measure similarity between two document vectors from a document-term matrix when the goal is to find documents with similar topics regardless of length?
Attempts:
2 left
💡 Hint
Consider a metric that ignores vector length and focuses on direction.
✗ Incorrect
Cosine similarity measures the angle between two vectors, ignoring their length. This is ideal for document vectors where length varies but topic similarity depends on word distribution.
🔧 Debug
advanced2:00remaining
Identifying the Error in Document-Term Matrix Code
What error will the following code raise?
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['cat dog', 'dog mouse']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(X[0, 1])
Attempts:
2 left
💡 Hint
csr_matrix supports indexing with X[i, j].
✗ Incorrect
No error is raised. X is a csr_matrix, which supports direct indexing like X[0, 1]. Here, the vocabulary is ['cat', 'dog', 'mouse'] (alphabetical order), so X[0, 1] is the count of 'dog' in the first document, which is 1.
❓ Model Choice
expert2:30remaining
Best Model to Use with Document-Term Matrix for Text Classification
Given a document-term matrix representing text data, which machine learning model is generally most suitable for classifying documents into categories when the data is high-dimensional and sparse?
Attempts:
2 left
💡 Hint
Consider models that handle high-dimensional sparse data well and avoid overfitting.
✗ Incorrect
SVM with a linear kernel works well on high-dimensional sparse data like document-term matrices. It finds a hyperplane that separates classes effectively and handles sparsity better than KNN or Decision Trees. Naive Bayes is also common but SVM often performs better in complex text classification.