Bird
Raised Fist0
NLPml~10 mins

Document-term matrix in NLP - Interactive Code Practice

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Practice - 5 Tasks
Answer the questions below
1fill in blank
easy

Complete the code to create a document-term matrix using CountVectorizer.

NLP
from sklearn.feature_extraction.text import CountVectorizer

docs = ['I love AI', 'AI loves me']
vectorizer = CountVectorizer()
dtm = vectorizer.[1](docs)
print(dtm.toarray())
Drag options to blanks, or click blank then click option'
Afit_transform
Btransform
Cfit
Dtoarray
Attempts:
3 left
💡 Hint
Common Mistakes
Using transform before fitting the vectorizer.
Calling fit without transforming the data.
2fill in blank
medium

Complete the code to get the feature names (words) from the vectorizer.

NLP
from sklearn.feature_extraction.text import CountVectorizer

docs = ['Data science is fun']
vectorizer = CountVectorizer()
dtm = vectorizer.fit_transform(docs)
words = vectorizer.[1]()
print(words)
Drag options to blanks, or click blank then click option'
Afeatures
Bget_feature_names
Cvocabulary_
Dget_feature_names_out
Attempts:
3 left
💡 Hint
Common Mistakes
Using get_feature_names which is deprecated.
Trying to access vocabulary_ directly instead of using the method.
3fill in blank
hard

Fix the error in the code to correctly create a document-term matrix from the list of documents.

NLP
from sklearn.feature_extraction.text import CountVectorizer

docs = ['Machine learning', 'Learning machines']
vectorizer = CountVectorizer()
dtm = vectorizer.[1](docs)
print(dtm.toarray())
Drag options to blanks, or click blank then click option'
Atransform
Bfit_transform
Cfit
Dtoarray
Attempts:
3 left
💡 Hint
Common Mistakes
Using transform without fitting first.
Using fit without transforming.
4fill in blank
hard

Fill both blanks to create a document-term matrix and get the feature names.

NLP
from sklearn.feature_extraction.text import CountVectorizer

docs = ['AI is amazing', 'Amazing AI']
vectorizer = CountVectorizer()
dtm = vectorizer.[1](docs)
features = vectorizer.[2]()
print(features)
Drag options to blanks, or click blank then click option'
Afit_transform
Btransform
Cget_feature_names_out
Dget_feature_names
Attempts:
3 left
💡 Hint
Common Mistakes
Using transform instead of fit_transform.
Using deprecated get_feature_names method.
5fill in blank
hard

Fill all three blanks to create a document-term matrix, get feature names, and print the matrix as an array.

NLP
from sklearn.feature_extraction.text import CountVectorizer

docs = ['Deep learning', 'Learning deep']
vectorizer = CountVectorizer()
dtm = vectorizer.[1](docs)
features = vectorizer.[2]()
print(dtm.[3]())
Drag options to blanks, or click blank then click option'
Afit_transform
Bget_feature_names_out
Ctoarray
Dtransform
Attempts:
3 left
💡 Hint
Common Mistakes
Using transform instead of fit_transform.
Using get_feature_names instead of get_feature_names_out.
Forgetting to convert the matrix to an array before printing.

Practice

(1/5)
1. What does a document-term matrix represent in natural language processing?
easy
A. The length of each document
B. The order of words in a sentence
C. The meaning of each word
D. Counts of words in each document

Solution

  1. Step 1: Understand the purpose of a document-term matrix

    A document-term matrix counts how many times each word appears in each document.
  2. Step 2: Compare options with this definition

    Only Counts of words in each document correctly describes this counting process.
  3. Final Answer:

    Counts of words in each document -> Option D
  4. Quick Check:

    Document-term matrix = word counts [OK]
Hint: Remember: matrix counts words per document [OK]
Common Mistakes:
  • Confusing word order with counts
  • Thinking it shows word meanings
  • Assuming it measures document length
2. Which Python library provides the CountVectorizer class to create a document-term matrix?
easy
A. numpy
B. pandas
C. scikit-learn
D. matplotlib

Solution

  1. Step 1: Recall the library for text feature extraction

    CountVectorizer is part of scikit-learn, a popular machine learning library.
  2. Step 2: Verify other options

    numpy is for arrays, pandas for data frames, matplotlib for plotting, so they don't provide CountVectorizer.
  3. Final Answer:

    scikit-learn -> Option C
  4. Quick Check:

    CountVectorizer = scikit-learn [OK]
Hint: CountVectorizer is from scikit-learn, not numpy [OK]
Common Mistakes:
  • Choosing numpy because it handles arrays
  • Confusing pandas with text vectorization
  • Selecting matplotlib for visualization
3. What is the output of this Python code snippet?
from sklearn.feature_extraction.text import CountVectorizer
texts = ['cat dog', 'dog dog cat']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
print(X.toarray())
medium
A. [[1 1] [1 2]]
B. [[1 1] [2 1]]
C. [[2 1] [1 2]]
D. [[1 2] [1 1]]

Solution

  1. Step 1: Identify the vocabulary and word counts

    The texts are 'cat dog' and 'dog dog cat'. Vocabulary sorted alphabetically is ['cat', 'dog']. First document has 1 'cat' and 1 'dog'. Second document has 1 'cat' and 2 'dog's.
  2. Step 2: Form the document-term matrix

    Matrix rows correspond to documents, columns to words: [[1,1],[1,2]].
  3. Final Answer:

    [[1 1] [1 2]] -> Option A
  4. Quick Check:

    Word counts match matrix [OK]
Hint: Count words per document in alphabetical order [OK]
Common Mistakes:
  • Mixing order of words in vocabulary
  • Counting wrong number of word occurrences
  • Confusing rows and columns
4. Identify the error in this code that tries to create a document-term matrix:
from sklearn.feature_extraction.text import CountVectorizer
texts = ['apple orange', 'orange apple apple']
vectorizer = CountVectorizer()
X = vectorizer.transform(texts)
print(X.toarray())
medium
A. toarray() is not a method of X
B. Missing fit() before transform()
C. texts should be a single string, not a list
D. CountVectorizer() should be CountVector()

Solution

  1. Step 1: Understand CountVectorizer usage

    CountVectorizer requires calling fit() or fit_transform() before transform() to learn vocabulary.
  2. Step 2: Check the code sequence

    The code calls transform() directly without fit(), causing an error.
  3. Final Answer:

    Missing fit() before transform() -> Option B
  4. Quick Check:

    fit() needed before transform() [OK]
Hint: Always fit before transform with CountVectorizer [OK]
Common Mistakes:
  • Skipping fit() step
  • Using wrong class name
  • Passing wrong data type to vectorizer
5. You have three documents: ['sun moon', 'moon moon sun', 'star sun moon']. Using CountVectorizer, what is the shape of the document-term matrix and which word has the highest total count across all documents?
hard
A. Shape (3, 3), 'moon' has highest count
B. Shape (3, 4), 'sun' has highest count
C. Shape (3, 3), 'sun' has highest count
D. Shape (3, 4), 'moon' has highest count

Solution

  1. Step 1: Identify unique words and matrix shape

    Unique words are 'sun', 'moon', 'star' -> 3 words. There are 3 documents, so shape is (3, 3).
  2. Step 2: Count total occurrences of each word

    'sun': appears 1 + 1 + 1 = 3 times 'moon': appears 1 + 2 + 1 = 4 times 'star': appears 0 + 0 + 1 = 1 time Highest count is 'moon' with 4.
  3. Final Answer:

    Shape (3, 3), 'moon' has highest count -> Option A
  4. Quick Check:

    3 docs x 3 words, moon count highest [OK]
Hint: Count unique words for shape, sum counts for highest word [OK]
Common Mistakes:
  • Counting duplicate words as unique
  • Mixing up shape dimensions
  • Incorrectly summing word counts