Bird
0
0

You want to find 3 topics from a set of documents but also want to ignore very common words like 'the' and 'and'. Which combination of scikit-learn tools is best?

hard📝 Conceptual Q8 of 15
NLP - Topic Modeling
You want to find 3 topics from a set of documents but also want to ignore very common words like 'the' and 'and'. Which combination of scikit-learn tools is best?
AUse CountVectorizer with stop_words='english' and then fit LDA with n_components=3
BUse TfidfVectorizer with n_components=3 directly for LDA
CUse CountVectorizer without stop words and set n_components=1 in LDA
DUse StandardScaler on raw text and then fit LDA with n_components=3
Step-by-Step Solution
Solution:
  1. Step 1: Remove common stop words before LDA

    CountVectorizer supports removing English stop words with stop_words='english'.
  2. Step 2: Set number of topics in LDA

    Set n_components=3 to find 3 topics. TfidfVectorizer is not recommended for LDA input.
  3. Final Answer:

    Use CountVectorizer with stop_words='english' and then fit LDA with n_components=3 -> Option A
  4. Quick Check:

    Stop words removal + n_components=3 = Use CountVectorizer with stop_words='english' and then fit LDA with n_components=3 [OK]
Quick Trick: Remove stop words with CountVectorizer before LDA [OK]
Common Mistakes:
MISTAKES
  • Using TfidfVectorizer directly for LDA
  • Not removing stop words
  • Setting n_components too low

Want More Practice?

15+ quiz questions · All difficulty levels · Free

Free Signup - Practice All Questions
More NLP Quizzes