NLPml~20 mins

Document-term matrix in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Document-term matrix

Problem:You want to convert a small set of text documents into a document-term matrix to prepare for text analysis.

Current Metrics:The current code creates a document-term matrix but includes all words, including very common words like 'the' and 'is', which may not be useful.

Issue:The matrix is too large and noisy because it includes stop words and very rare words, making it harder to analyze and slowing down further processing.

Your Task

Create a cleaner document-term matrix by removing common stop words and very rare words, reducing noise and matrix size.

Use Python and scikit-learn's CountVectorizer.

Keep the vocabulary size manageable (e.g., max 10 words).

Do not use any external datasets.

Hint 1

Hint 2

Hint 3

Solution

NLP

from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
texts = [
    'The cat sat on the mat.',
    'Dogs and cats are great pets.',
    'I love my dog.',
    'Cats are playful and cute.',
    'The dog chased the cat.'
]

# Original vectorizer without stop words
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
print(f'Original matrix shape: {X.shape}')
print(f'Original vocabulary: {vectorizer.get_feature_names_out()}')

# Vectorizer with stop words removal and min_df to remove rare words
clean_vectorizer = CountVectorizer(stop_words='english', min_df=2, max_features=10)
X_clean = clean_vectorizer.fit_transform(texts)
print(f'Cleaned matrix shape: {X_clean.shape}')
print(f'Cleaned vocabulary: {clean_vectorizer.get_feature_names_out()}')

Added stop_words='english' to remove common English words like 'the', 'and', 'is'.

Added min_df=2 to ignore words that appear in fewer than 2 documents.

Added max_features=10 to limit the vocabulary size to the 10 most frequent words.

Printed matrix shapes and vocabularies before and after cleaning to compare.

Results Interpretation

Before cleaning, the document-term matrix had 17 columns (words), including common stop words and rare words.

After cleaning, the matrix reduced to 3 columns, removing stop words and rare words, making it smaller and more focused.

Removing stop words and rare words helps create a cleaner, smaller document-term matrix that is easier to analyze and speeds up further text processing.

Bonus Experiment

Try creating a TF-IDF matrix instead of a simple count matrix to weigh words by importance.

💡 Hint

Use sklearn's TfidfVectorizer with similar parameters to see how word importance changes.

Practice

(1/5)

1. What does a document-term matrix represent in natural language processing?

easy

A. The length of each document

B. The order of words in a sentence

C. The meaning of each word

D. Counts of words in each document

Document-term matrix in NLP - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand the purpose of a document-term matrix

Step 2: Compare options with this definition

Final Answer:

Quick Check:

Solution

Step 1: Recall the library for text feature extraction

Step 2: Verify other options

Final Answer:

Quick Check:

Solution

Step 1: Identify the vocabulary and word counts

Step 2: Form the document-term matrix

Final Answer:

Quick Check:

Solution

Step 1: Understand CountVectorizer usage

Step 2: Check the code sequence

Final Answer:

Quick Check:

Solution

Step 1: Identify unique words and matrix shape

Step 2: Count total occurrences of each word

Final Answer:

Quick Check: