A document-term matrix helps us turn text into numbers so computers can understand and learn from it.
Document-term matrix in NLP
Start learning this pattern below
Jump into concepts and practice - no test required
or
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Introduction
Syntax
NLP
from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() dtm = vectorizer.fit_transform(documents) # dtm is a matrix where rows are documents and columns are words # dtm[i, j] shows how many times word j appears in document i
CountVectorizer converts text to a matrix of word counts.
The fit_transform method learns the vocabulary and creates the matrix in one step.
Examples
NLP
documents = ["I love cats", "Cats love fish"] vectorizer = CountVectorizer() dtm = vectorizer.fit_transform(documents) print(dtm.toarray())
NLP
vectorizer = CountVectorizer(stop_words='english') dtm = vectorizer.fit_transform(documents) print(vectorizer.get_feature_names_out())
Sample Model
This program turns three sentences into a matrix showing how often each word appears in each sentence.
NLP
from sklearn.feature_extraction.text import CountVectorizer # Sample documents documents = [ "Machine learning is fun", "Learning machines be fun", "Fun with machine learning" ] # Create the vectorizer vectorizer = CountVectorizer() # Fit and transform the documents into a document-term matrix dtm = vectorizer.fit_transform(documents) # Show the feature names (words) print("Words:", vectorizer.get_feature_names_out()) # Show the document-term matrix as an array print("Document-Term Matrix:\n", dtm.toarray())
Important Notes
The document-term matrix is usually very sparse because most words don't appear in every document.
You can use other vectorizers like TfidfVectorizer to weigh words differently.
Summary
A document-term matrix changes text into numbers by counting words.
It helps computers understand and compare documents.
CountVectorizer from scikit-learn is a simple way to create this matrix.
Practice
1. What does a document-term matrix represent in natural language processing?
easy
Solution
Step 1: Understand the purpose of a document-term matrix
A document-term matrix counts how many times each word appears in each document.Step 2: Compare options with this definition
Only Counts of words in each document correctly describes this counting process.Final Answer:
Counts of words in each document -> Option DQuick Check:
Document-term matrix = word counts [OK]
Hint: Remember: matrix counts words per document [OK]
Common Mistakes:
- Confusing word order with counts
- Thinking it shows word meanings
- Assuming it measures document length
2. Which Python library provides the
CountVectorizer class to create a document-term matrix?easy
Solution
Step 1: Recall the library for text feature extraction
CountVectorizer is part of scikit-learn, a popular machine learning library.Step 2: Verify other options
numpy is for arrays, pandas for data frames, matplotlib for plotting, so they don't provide CountVectorizer.Final Answer:
scikit-learn -> Option CQuick Check:
CountVectorizer = scikit-learn [OK]
Hint: CountVectorizer is from scikit-learn, not numpy [OK]
Common Mistakes:
- Choosing numpy because it handles arrays
- Confusing pandas with text vectorization
- Selecting matplotlib for visualization
3. What is the output of this Python code snippet?
from sklearn.feature_extraction.text import CountVectorizer texts = ['cat dog', 'dog dog cat'] vectorizer = CountVectorizer() X = vectorizer.fit_transform(texts) print(X.toarray())
medium
Solution
Step 1: Identify the vocabulary and word counts
The texts are 'cat dog' and 'dog dog cat'. Vocabulary sorted alphabetically is ['cat', 'dog']. First document has 1 'cat' and 1 'dog'. Second document has 1 'cat' and 2 'dog's.Step 2: Form the document-term matrix
Matrix rows correspond to documents, columns to words: [[1,1],[1,2]].Final Answer:
[[1 1] [1 2]] -> Option AQuick Check:
Word counts match matrix [OK]
Hint: Count words per document in alphabetical order [OK]
Common Mistakes:
- Mixing order of words in vocabulary
- Counting wrong number of word occurrences
- Confusing rows and columns
4. Identify the error in this code that tries to create a document-term matrix:
from sklearn.feature_extraction.text import CountVectorizer texts = ['apple orange', 'orange apple apple'] vectorizer = CountVectorizer() X = vectorizer.transform(texts) print(X.toarray())
medium
Solution
Step 1: Understand CountVectorizer usage
CountVectorizer requires calling fit() or fit_transform() before transform() to learn vocabulary.Step 2: Check the code sequence
The code calls transform() directly without fit(), causing an error.Final Answer:
Missing fit() before transform() -> Option BQuick Check:
fit() needed before transform() [OK]
Hint: Always fit before transform with CountVectorizer [OK]
Common Mistakes:
- Skipping fit() step
- Using wrong class name
- Passing wrong data type to vectorizer
5. You have three documents: ['sun moon', 'moon moon sun', 'star sun moon']. Using CountVectorizer, what is the shape of the document-term matrix and which word has the highest total count across all documents?
hard
Solution
Step 1: Identify unique words and matrix shape
Unique words are 'sun', 'moon', 'star' -> 3 words. There are 3 documents, so shape is (3, 3).Step 2: Count total occurrences of each word
'sun': appears 1 + 1 + 1 = 3 times 'moon': appears 1 + 2 + 1 = 4 times 'star': appears 0 + 0 + 1 = 1 time Highest count is 'moon' with 4.Final Answer:
Shape (3, 3), 'moon' has highest count -> Option AQuick Check:
3 docs x 3 words, moon count highest [OK]
Hint: Count unique words for shape, sum counts for highest word [OK]
Common Mistakes:
- Counting duplicate words as unique
- Mixing up shape dimensions
- Incorrectly summing word counts
