0
0
NLPml~20 mins

Document-term matrix in NLP - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Document-term matrix
Problem:You want to convert a small set of text documents into a document-term matrix to prepare for text analysis.
Current Metrics:The current code creates a document-term matrix but includes all words, including very common words like 'the' and 'is', which may not be useful.
Issue:The matrix is too large and noisy because it includes stop words and very rare words, making it harder to analyze and slowing down further processing.
Your Task
Create a cleaner document-term matrix by removing common stop words and very rare words, reducing noise and matrix size.
Use Python and scikit-learn's CountVectorizer.
Keep the vocabulary size manageable (e.g., max 10 words).
Do not use any external datasets.
Hint 1
Hint 2
Hint 3
Solution
NLP
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
texts = [
    'The cat sat on the mat.',
    'Dogs and cats are great pets.',
    'I love my dog.',
    'Cats are playful and cute.',
    'The dog chased the cat.'
]

# Original vectorizer without stop words
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
print(f'Original matrix shape: {X.shape}')
print(f'Original vocabulary: {vectorizer.get_feature_names_out()}')

# Vectorizer with stop words removal and min_df to remove rare words
clean_vectorizer = CountVectorizer(stop_words='english', min_df=2, max_features=10)
X_clean = clean_vectorizer.fit_transform(texts)
print(f'Cleaned matrix shape: {X_clean.shape}')
print(f'Cleaned vocabulary: {clean_vectorizer.get_feature_names_out()}')
Added stop_words='english' to remove common English words like 'the', 'and', 'is'.
Added min_df=2 to ignore words that appear in fewer than 2 documents.
Added max_features=10 to limit the vocabulary size to the 10 most frequent words.
Printed matrix shapes and vocabularies before and after cleaning to compare.
Results Interpretation

Before cleaning, the document-term matrix had 17 columns (words), including common stop words and rare words.

After cleaning, the matrix reduced to 3 columns, removing stop words and rare words, making it smaller and more focused.

Removing stop words and rare words helps create a cleaner, smaller document-term matrix that is easier to analyze and speeds up further text processing.
Bonus Experiment
Try creating a TF-IDF matrix instead of a simple count matrix to weigh words by importance.
💡 Hint
Use sklearn's TfidfVectorizer with similar parameters to see how word importance changes.