from sklearn.feature_extraction.text import CountVectorizer texts = ['I love AI', 'AI is fun'] vectorizer = CountVectorizer(stop_words='english') X = vectorizer.fit_transform(texts) print(vectorizer.vocabulary_['AI'])

Practice

(1/5)

1. What is the main purpose of an NLP pipeline in machine learning?

easy

A. To translate text into different languages automatically

B. To store large amounts of text data

C. To process text step-by-step for making predictions

D. To create images from text

Solution

Step 1: Understand the role of an NLP pipeline
An NLP pipeline breaks down text processing into steps like cleaning, vectorizing, and modeling.
Step 2: Identify the goal of these steps
The goal is to prepare text data so a model can make predictions, such as classifying or understanding text.
Final Answer:
To process text step-by-step for making predictions -> Option C
Quick Check:
NLP pipeline = step-by-step text processing for predictions [OK]

Hint: Remember: pipeline means step-by-step processing [OK]

Common Mistakes:

Thinking pipeline stores data only
Confusing pipeline with translation tools
Assuming pipeline creates images

2. Which of the following is the correct way to import a text vectorizer from scikit-learn for an NLP pipeline?

easy

A. import CountVectorizer from sklearn.text

B. from sklearn.feature_extraction.text import CountVectorizer

C. from sklearn.vectorizer import TextCount

D. import text_vectorizer from sklearn.feature

Solution

Step 1: Recall the correct module for text vectorizers
Scikit-learn provides CountVectorizer in the feature_extraction.text module.
Step 2: Check the import syntax
The correct syntax is: from sklearn.feature_extraction.text import CountVectorizer.
Final Answer:
from sklearn.feature_extraction.text import CountVectorizer -> Option B
Quick Check:
Correct import = from sklearn.feature_extraction.text import CountVectorizer [OK]

Hint: Remember: CountVectorizer is in feature_extraction.text [OK]

Common Mistakes:

Using wrong module names
Incorrect import syntax
Confusing class names

3. Given the following code snippet, what will be the output of print(X.toarray())?

from sklearn.feature_extraction.text import CountVectorizer
texts = ['cat and dog', 'dog and mouse']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
print(X.toarray())

medium

A. [[1 1 1 0] [1 0 1 1]]

B. [[1 0 1 1] [1 1 0 1]]

C. [[1 1 0 1] [1 0 1 1]]

D. [[0 1 1 1] [1 1 1 0]]

Solution

Step 1: Identify the vocabulary from the texts
The texts are 'cat and dog' and 'dog and mouse'. The unique words are: 'and', 'cat', 'dog', 'mouse'. CountVectorizer sorts them alphabetically: ['and', 'cat', 'dog', 'mouse'].
Step 2: Map each text to counts of these words
First text: 'cat and dog' -> counts: and=1, cat=1, dog=1, mouse=0 -> [1 1 1 0]. Second text: 'dog and mouse' -> counts: and=1, cat=0, dog=1, mouse=1 -> [1 0 1 1].
Final Answer:
[[1 1 1 0] [1 0 1 1]] -> Option A
Quick Check:
Vocabulary order and counts match [[1 1 1 0] [1 0 1 1]] [OK]

Hint: Remember: CountVectorizer sorts words alphabetically [OK]

Common Mistakes:

Mixing word order in output
Confusing counts of words
Assuming different vocabulary order

4. You wrote this code but get an error: AttributeError: 'CountVectorizer' object has no attribute 'transform_text'. What is the likely fix?

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer.transform_text(['hello world'])

medium

A. Replace transform_text with transform

B. Import CountVectorizer from a different module

C. Call fit before transform_text

D. Use fit_transform_text instead

Solution

Step 1: Identify the incorrect method name
The error says 'CountVectorizer' has no method 'transform_text'. The correct method is 'transform'.
Step 2: Correct the method call
Replace transform_text with transform to fix the error.
Final Answer:
Replace transform_text with transform -> Option A
Quick Check:
Correct method name is transform [OK]

Hint: Check method names carefully in docs [OK]

Common Mistakes:

Using non-existent method names
Not reading error messages
Trying to call fit_transform_text which doesn't exist

5. You want to build a simple NLP pipeline that converts text to numbers and then trains a logistic regression model to classify text. Which sequence of steps is correct?

hard

A. Predict on new text -> Vectorize text -> Train logistic regression

B. Train logistic regression -> Vectorize text -> Predict on new text

C. Vectorize text -> Predict on new text -> Train logistic regression

D. Vectorize text -> Train logistic regression -> Predict on new text

Solution

Step 1: Understand the pipeline order
First, text must be converted to numbers using vectorization before training a model.
Step 2: Follow logical flow
After vectorizing, train the logistic regression model, then use it to predict on new vectorized text.
Final Answer:
Vectorize text -> Train logistic regression -> Predict on new text -> Option D
Quick Check:
Correct pipeline order = Vectorize text -> Train logistic regression -> Predict on new text [OK]

Hint: Always vectorize before training or predicting [OK]

Common Mistakes:

Trying to train before vectorizing
Predicting before training
Skipping vectorization step

First NLP pipeline - Practice Problems & Coding Challenges

Start learning this pattern below

Practice

Solution

Step 1: Understand the role of an NLP pipeline

Step 2: Identify the goal of these steps

Final Answer:

Quick Check:

Solution

Step 1: Recall the correct module for text vectorizers

Step 2: Check the import syntax

Final Answer:

Quick Check:

Solution

Step 1: Identify the vocabulary from the texts

Step 2: Map each text to counts of these words

Final Answer:

Quick Check:

Solution

Step 1: Identify the incorrect method name

Step 2: Correct the method call

Final Answer:

Quick Check:

Solution

Step 1: Understand the pipeline order

Step 2: Follow logical flow

Final Answer:

Quick Check: