Bird
Raised Fist0
NLPml~5 mins

Document-term matrix in NLP - Cheat Sheet & Quick Revision

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Recall & Review
beginner
What is a Document-term matrix (DTM)?
A Document-term matrix is a table that shows how often each word appears in each document. Rows are documents, columns are words, and the cells have counts of word appearances.
Click to reveal answer
beginner
Why do we use a Document-term matrix in text analysis?
We use a Document-term matrix to turn text into numbers so computers can understand and analyze it, like finding patterns or training machine learning models.
Click to reveal answer
beginner
What does each row and column represent in a Document-term matrix?
Each row represents a document, and each column represents a unique word (term) from all documents.
Click to reveal answer
intermediate
How can the values in a Document-term matrix be weighted besides simple counts?
Values can be weighted using methods like TF-IDF, which gives more importance to words that are common in one document but rare across others.
Click to reveal answer
intermediate
What is a common problem with Document-term matrices and how is it handled?
DTMs can be very large and sparse (mostly zeros). We handle this by removing rare words, using dimensionality reduction, or applying sparse matrix storage.
Click to reveal answer
What does a cell value in a Document-term matrix usually represent?
AThe length of the document
BThe number of times a word appears in a document
CThe number of documents containing the word
DThe total number of words in all documents
In a Document-term matrix, what do the rows represent?
AWords
BParagraphs
CSentences
DDocuments
Which technique can improve the usefulness of a Document-term matrix by weighting words?
ATF-IDF
BClustering
CNormalization
DTokenization
What is a common issue with Document-term matrices?
AThey are sparse with many zeros
BThey contain too many images
CThey are always too small
DThey cannot be used for machine learning
How can we reduce the size of a Document-term matrix?
ABy adding more documents
BBy increasing the number of words
CBy removing rare words and using dimensionality reduction
DBy converting text to uppercase
Explain what a Document-term matrix is and why it is useful in text analysis.
Think about how text is turned into numbers for computers.
You got /5 concepts.
    Describe common challenges with Document-term matrices and how to address them.
    Consider what happens when many words appear rarely.
    You got /5 concepts.

      Practice

      (1/5)
      1. What does a document-term matrix represent in natural language processing?
      easy
      A. The length of each document
      B. The order of words in a sentence
      C. The meaning of each word
      D. Counts of words in each document

      Solution

      1. Step 1: Understand the purpose of a document-term matrix

        A document-term matrix counts how many times each word appears in each document.
      2. Step 2: Compare options with this definition

        Only Counts of words in each document correctly describes this counting process.
      3. Final Answer:

        Counts of words in each document -> Option D
      4. Quick Check:

        Document-term matrix = word counts [OK]
      Hint: Remember: matrix counts words per document [OK]
      Common Mistakes:
      • Confusing word order with counts
      • Thinking it shows word meanings
      • Assuming it measures document length
      2. Which Python library provides the CountVectorizer class to create a document-term matrix?
      easy
      A. numpy
      B. pandas
      C. scikit-learn
      D. matplotlib

      Solution

      1. Step 1: Recall the library for text feature extraction

        CountVectorizer is part of scikit-learn, a popular machine learning library.
      2. Step 2: Verify other options

        numpy is for arrays, pandas for data frames, matplotlib for plotting, so they don't provide CountVectorizer.
      3. Final Answer:

        scikit-learn -> Option C
      4. Quick Check:

        CountVectorizer = scikit-learn [OK]
      Hint: CountVectorizer is from scikit-learn, not numpy [OK]
      Common Mistakes:
      • Choosing numpy because it handles arrays
      • Confusing pandas with text vectorization
      • Selecting matplotlib for visualization
      3. What is the output of this Python code snippet?
      from sklearn.feature_extraction.text import CountVectorizer
      texts = ['cat dog', 'dog dog cat']
      vectorizer = CountVectorizer()
      X = vectorizer.fit_transform(texts)
      print(X.toarray())
      medium
      A. [[1 1] [1 2]]
      B. [[1 1] [2 1]]
      C. [[2 1] [1 2]]
      D. [[1 2] [1 1]]

      Solution

      1. Step 1: Identify the vocabulary and word counts

        The texts are 'cat dog' and 'dog dog cat'. Vocabulary sorted alphabetically is ['cat', 'dog']. First document has 1 'cat' and 1 'dog'. Second document has 1 'cat' and 2 'dog's.
      2. Step 2: Form the document-term matrix

        Matrix rows correspond to documents, columns to words: [[1,1],[1,2]].
      3. Final Answer:

        [[1 1] [1 2]] -> Option A
      4. Quick Check:

        Word counts match matrix [OK]
      Hint: Count words per document in alphabetical order [OK]
      Common Mistakes:
      • Mixing order of words in vocabulary
      • Counting wrong number of word occurrences
      • Confusing rows and columns
      4. Identify the error in this code that tries to create a document-term matrix:
      from sklearn.feature_extraction.text import CountVectorizer
      texts = ['apple orange', 'orange apple apple']
      vectorizer = CountVectorizer()
      X = vectorizer.transform(texts)
      print(X.toarray())
      medium
      A. toarray() is not a method of X
      B. Missing fit() before transform()
      C. texts should be a single string, not a list
      D. CountVectorizer() should be CountVector()

      Solution

      1. Step 1: Understand CountVectorizer usage

        CountVectorizer requires calling fit() or fit_transform() before transform() to learn vocabulary.
      2. Step 2: Check the code sequence

        The code calls transform() directly without fit(), causing an error.
      3. Final Answer:

        Missing fit() before transform() -> Option B
      4. Quick Check:

        fit() needed before transform() [OK]
      Hint: Always fit before transform with CountVectorizer [OK]
      Common Mistakes:
      • Skipping fit() step
      • Using wrong class name
      • Passing wrong data type to vectorizer
      5. You have three documents: ['sun moon', 'moon moon sun', 'star sun moon']. Using CountVectorizer, what is the shape of the document-term matrix and which word has the highest total count across all documents?
      hard
      A. Shape (3, 3), 'moon' has highest count
      B. Shape (3, 4), 'sun' has highest count
      C. Shape (3, 3), 'sun' has highest count
      D. Shape (3, 4), 'moon' has highest count

      Solution

      1. Step 1: Identify unique words and matrix shape

        Unique words are 'sun', 'moon', 'star' -> 3 words. There are 3 documents, so shape is (3, 3).
      2. Step 2: Count total occurrences of each word

        'sun': appears 1 + 1 + 1 = 3 times 'moon': appears 1 + 2 + 1 = 4 times 'star': appears 0 + 0 + 1 = 1 time Highest count is 'moon' with 4.
      3. Final Answer:

        Shape (3, 3), 'moon' has highest count -> Option A
      4. Quick Check:

        3 docs x 3 words, moon count highest [OK]
      Hint: Count unique words for shape, sum counts for highest word [OK]
      Common Mistakes:
      • Counting duplicate words as unique
      • Mixing up shape dimensions
      • Incorrectly summing word counts