Bird
Raised Fist0
NLPml~8 mins

Document-term matrix in NLP - Model Metrics & Evaluation

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Metrics & Evaluation - Document-term matrix
Which metric matters for Document-term matrix and WHY

A Document-term matrix (DTM) itself is a way to represent text data as numbers. It shows how often each word appears in each document. The quality of a DTM is often judged by how well it helps a model learn or find patterns.

Metrics like sparsity (how many zeros it has) matter because a very sparse matrix can slow down learning. Also, when using the DTM for tasks like classification, metrics such as accuracy, precision, and recall on the model built from the DTM become important.

In short, the DTM itself is a data format, so we look at metrics that tell us if it represents the text well and helps models perform better.

Confusion matrix or equivalent visualization

Since DTM is a data representation, it does not have a confusion matrix by itself. But when used in classification, a confusion matrix looks like this:

      | Predicted Positive | Predicted Negative |
      |--------------------|--------------------|
      | True Positive (TP)  | False Negative (FN) |
      | False Positive (FP) | True Negative (TN)  |
    

This matrix helps us calculate precision and recall for models using the DTM.

Precision vs Recall tradeoff with concrete examples

Imagine using a DTM to detect spam emails:

  • High precision means most emails marked as spam really are spam. This avoids annoying users by not marking good emails as spam.
  • High recall means catching most spam emails, even if some good emails get marked wrongly.

Depending on what matters more, you might tune your model differently. The DTM quality affects how well the model can balance this tradeoff.

What "good" vs "bad" metric values look like for this use case

Good DTM characteristics:

  • Low sparsity (not too many zeros) so models learn better.
  • Words chosen capture important meaning (not just common words).

Good model metrics using DTM:

  • Accuracy above 80% for simple tasks.
  • Precision and recall balanced above 70% for spam detection.

Bad signs:

  • Very sparse DTM with many irrelevant words.
  • Model accuracy near random guessing (e.g., 50% for two classes).
  • Precision very high but recall very low, or vice versa, without reason.
Metrics pitfalls
  • Accuracy paradox: High accuracy can happen if one class dominates, but the model ignores the smaller class.
  • Data leakage: If the DTM includes words that reveal the answer directly, the model looks better but won't work in real life.
  • Overfitting: A very large DTM with many rare words can cause the model to memorize training data but fail on new data.
  • Ignoring sparsity: Too many zero entries slow down training and may reduce model quality.
Self-check question

Your model built on a Document-term matrix has 98% accuracy but only 12% recall on the spam class. Is it good for production? Why or why not?

Answer: No, it is not good. The model misses most spam emails (low recall), even though overall accuracy is high. This means it mostly predicts emails as not spam, which is not useful for catching spam.

Key Result
Document-term matrix quality affects model metrics like precision and recall, which must be balanced for good text classification.

Practice

(1/5)
1. What does a document-term matrix represent in natural language processing?
easy
A. The length of each document
B. The order of words in a sentence
C. The meaning of each word
D. Counts of words in each document

Solution

  1. Step 1: Understand the purpose of a document-term matrix

    A document-term matrix counts how many times each word appears in each document.
  2. Step 2: Compare options with this definition

    Only Counts of words in each document correctly describes this counting process.
  3. Final Answer:

    Counts of words in each document -> Option D
  4. Quick Check:

    Document-term matrix = word counts [OK]
Hint: Remember: matrix counts words per document [OK]
Common Mistakes:
  • Confusing word order with counts
  • Thinking it shows word meanings
  • Assuming it measures document length
2. Which Python library provides the CountVectorizer class to create a document-term matrix?
easy
A. numpy
B. pandas
C. scikit-learn
D. matplotlib

Solution

  1. Step 1: Recall the library for text feature extraction

    CountVectorizer is part of scikit-learn, a popular machine learning library.
  2. Step 2: Verify other options

    numpy is for arrays, pandas for data frames, matplotlib for plotting, so they don't provide CountVectorizer.
  3. Final Answer:

    scikit-learn -> Option C
  4. Quick Check:

    CountVectorizer = scikit-learn [OK]
Hint: CountVectorizer is from scikit-learn, not numpy [OK]
Common Mistakes:
  • Choosing numpy because it handles arrays
  • Confusing pandas with text vectorization
  • Selecting matplotlib for visualization
3. What is the output of this Python code snippet?
from sklearn.feature_extraction.text import CountVectorizer
texts = ['cat dog', 'dog dog cat']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
print(X.toarray())
medium
A. [[1 1] [1 2]]
B. [[1 1] [2 1]]
C. [[2 1] [1 2]]
D. [[1 2] [1 1]]

Solution

  1. Step 1: Identify the vocabulary and word counts

    The texts are 'cat dog' and 'dog dog cat'. Vocabulary sorted alphabetically is ['cat', 'dog']. First document has 1 'cat' and 1 'dog'. Second document has 1 'cat' and 2 'dog's.
  2. Step 2: Form the document-term matrix

    Matrix rows correspond to documents, columns to words: [[1,1],[1,2]].
  3. Final Answer:

    [[1 1] [1 2]] -> Option A
  4. Quick Check:

    Word counts match matrix [OK]
Hint: Count words per document in alphabetical order [OK]
Common Mistakes:
  • Mixing order of words in vocabulary
  • Counting wrong number of word occurrences
  • Confusing rows and columns
4. Identify the error in this code that tries to create a document-term matrix:
from sklearn.feature_extraction.text import CountVectorizer
texts = ['apple orange', 'orange apple apple']
vectorizer = CountVectorizer()
X = vectorizer.transform(texts)
print(X.toarray())
medium
A. toarray() is not a method of X
B. Missing fit() before transform()
C. texts should be a single string, not a list
D. CountVectorizer() should be CountVector()

Solution

  1. Step 1: Understand CountVectorizer usage

    CountVectorizer requires calling fit() or fit_transform() before transform() to learn vocabulary.
  2. Step 2: Check the code sequence

    The code calls transform() directly without fit(), causing an error.
  3. Final Answer:

    Missing fit() before transform() -> Option B
  4. Quick Check:

    fit() needed before transform() [OK]
Hint: Always fit before transform with CountVectorizer [OK]
Common Mistakes:
  • Skipping fit() step
  • Using wrong class name
  • Passing wrong data type to vectorizer
5. You have three documents: ['sun moon', 'moon moon sun', 'star sun moon']. Using CountVectorizer, what is the shape of the document-term matrix and which word has the highest total count across all documents?
hard
A. Shape (3, 3), 'moon' has highest count
B. Shape (3, 4), 'sun' has highest count
C. Shape (3, 3), 'sun' has highest count
D. Shape (3, 4), 'moon' has highest count

Solution

  1. Step 1: Identify unique words and matrix shape

    Unique words are 'sun', 'moon', 'star' -> 3 words. There are 3 documents, so shape is (3, 3).
  2. Step 2: Count total occurrences of each word

    'sun': appears 1 + 1 + 1 = 3 times 'moon': appears 1 + 2 + 1 = 4 times 'star': appears 0 + 0 + 1 = 1 time Highest count is 'moon' with 4.
  3. Final Answer:

    Shape (3, 3), 'moon' has highest count -> Option A
  4. Quick Check:

    3 docs x 3 words, moon count highest [OK]
Hint: Count unique words for shape, sum counts for highest word [OK]
Common Mistakes:
  • Counting duplicate words as unique
  • Mixing up shape dimensions
  • Incorrectly summing word counts