Bird
Raised Fist0
NLPml~20 mins

First NLP pipeline - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
NLP Pipeline Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
What is the output of this tokenization code?
Given the following Python code using NLTK, what is the output of the tokens variable?
NLP
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
text = "Hello world! This is a test."
tokens = word_tokenize(text)
print(tokens)
A['Hello world!', 'This is a test.']
B['Hello', 'world', '!', 'This', 'is', 'a', 'test', '.']
C['Hello', 'world!', 'This', 'is', 'a', 'test.']
D['Hello', 'world', 'This', 'is', 'a', 'test']
Attempts:
2 left
💡 Hint
Think about how word_tokenize splits punctuation as separate tokens.
Model Choice
intermediate
1:30remaining
Which model is best for sentiment analysis in an NLP pipeline?
You want to build a simple NLP pipeline to classify movie reviews as positive or negative. Which model is most suitable?
AA convolutional neural network for image classification
BA K-Means clustering model
CA linear regression model
DA pretrained BERT model fine-tuned on sentiment data
Attempts:
2 left
💡 Hint
Sentiment analysis is a text classification task; choose a model designed for text understanding.
Hyperparameter
advanced
1:30remaining
Which hyperparameter affects the number of words considered in a Bag-of-Words model?
In a Bag-of-Words NLP pipeline using CountVectorizer, which hyperparameter controls the maximum number of words (features) to keep?
Astop_words
Bmin_df
Cmax_features
Dngram_range
Attempts:
2 left
💡 Hint
This parameter limits the vocabulary size by frequency.
Metrics
advanced
1:30remaining
Which metric is best to evaluate an imbalanced text classification model?
You trained an NLP model to detect spam emails, but spam emails are only 5% of your data. Which metric is best to evaluate your model?
APrecision and Recall
BAccuracy
CMean Squared Error
DR-squared
Attempts:
2 left
💡 Hint
Accuracy can be misleading when classes are imbalanced.
🔧 Debug
expert
2:30remaining
Why does this NLP pipeline code raise a KeyError?
Consider this code snippet for text preprocessing:
from sklearn.feature_extraction.text import CountVectorizer
texts = ['I love AI', 'AI is fun']
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(texts)
print(vectorizer.vocabulary_['AI'])
Why does it raise a KeyError for 'AI'?
NLP
from sklearn.feature_extraction.text import CountVectorizer
texts = ['I love AI', 'AI is fun']
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(texts)
print(vectorizer.vocabulary_['AI'])
A'AI' is not in the vocabulary because CountVectorizer lowercases all tokens by default
B'AI' is removed because it is considered a stop word in English
CThe vocabulary_ attribute is not a dictionary, causing the error
DThe fit_transform method was not called before accessing vocabulary_
Attempts:
2 left
💡 Hint
Check how CountVectorizer processes text tokens before building vocabulary.

Practice

(1/5)
1. What is the main purpose of an NLP pipeline in machine learning?
easy
A. To translate text into different languages automatically
B. To store large amounts of text data
C. To process text step-by-step for making predictions
D. To create images from text

Solution

  1. Step 1: Understand the role of an NLP pipeline

    An NLP pipeline breaks down text processing into steps like cleaning, vectorizing, and modeling.
  2. Step 2: Identify the goal of these steps

    The goal is to prepare text data so a model can make predictions, such as classifying or understanding text.
  3. Final Answer:

    To process text step-by-step for making predictions -> Option C
  4. Quick Check:

    NLP pipeline = step-by-step text processing for predictions [OK]
Hint: Remember: pipeline means step-by-step processing [OK]
Common Mistakes:
  • Thinking pipeline stores data only
  • Confusing pipeline with translation tools
  • Assuming pipeline creates images
2. Which of the following is the correct way to import a text vectorizer from scikit-learn for an NLP pipeline?
easy
A. import CountVectorizer from sklearn.text
B. from sklearn.feature_extraction.text import CountVectorizer
C. from sklearn.vectorizer import TextCount
D. import text_vectorizer from sklearn.feature

Solution

  1. Step 1: Recall the correct module for text vectorizers

    Scikit-learn provides CountVectorizer in the feature_extraction.text module.
  2. Step 2: Check the import syntax

    The correct syntax is: from sklearn.feature_extraction.text import CountVectorizer.
  3. Final Answer:

    from sklearn.feature_extraction.text import CountVectorizer -> Option B
  4. Quick Check:

    Correct import = from sklearn.feature_extraction.text import CountVectorizer [OK]
Hint: Remember: CountVectorizer is in feature_extraction.text [OK]
Common Mistakes:
  • Using wrong module names
  • Incorrect import syntax
  • Confusing class names
3. Given the following code snippet, what will be the output of print(X.toarray())?
from sklearn.feature_extraction.text import CountVectorizer
texts = ['cat and dog', 'dog and mouse']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
print(X.toarray())
medium
A. [[1 1 1 0] [1 0 1 1]]
B. [[1 0 1 1] [1 1 0 1]]
C. [[1 1 0 1] [1 0 1 1]]
D. [[0 1 1 1] [1 1 1 0]]

Solution

  1. Step 1: Identify the vocabulary from the texts

    The texts are 'cat and dog' and 'dog and mouse'. The unique words are: 'and', 'cat', 'dog', 'mouse'. CountVectorizer sorts them alphabetically: ['and', 'cat', 'dog', 'mouse'].
  2. Step 2: Map each text to counts of these words

    First text: 'cat and dog' -> counts: and=1, cat=1, dog=1, mouse=0 -> [1 1 1 0]. Second text: 'dog and mouse' -> counts: and=1, cat=0, dog=1, mouse=1 -> [1 0 1 1].
  3. Final Answer:

    [[1 1 1 0] [1 0 1 1]] -> Option A
  4. Quick Check:

    Vocabulary order and counts match [[1 1 1 0] [1 0 1 1]] [OK]
Hint: Remember: CountVectorizer sorts words alphabetically [OK]
Common Mistakes:
  • Mixing word order in output
  • Confusing counts of words
  • Assuming different vocabulary order
4. You wrote this code but get an error: AttributeError: 'CountVectorizer' object has no attribute 'transform_text'. What is the likely fix?
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer.transform_text(['hello world'])
medium
A. Replace transform_text with transform
B. Import CountVectorizer from a different module
C. Call fit before transform_text
D. Use fit_transform_text instead

Solution

  1. Step 1: Identify the incorrect method name

    The error says 'CountVectorizer' has no method 'transform_text'. The correct method is 'transform'.
  2. Step 2: Correct the method call

    Replace transform_text with transform to fix the error.
  3. Final Answer:

    Replace transform_text with transform -> Option A
  4. Quick Check:

    Correct method name is transform [OK]
Hint: Check method names carefully in docs [OK]
Common Mistakes:
  • Using non-existent method names
  • Not reading error messages
  • Trying to call fit_transform_text which doesn't exist
5. You want to build a simple NLP pipeline that converts text to numbers and then trains a logistic regression model to classify text. Which sequence of steps is correct?
hard
A. Predict on new text -> Vectorize text -> Train logistic regression
B. Train logistic regression -> Vectorize text -> Predict on new text
C. Vectorize text -> Predict on new text -> Train logistic regression
D. Vectorize text -> Train logistic regression -> Predict on new text

Solution

  1. Step 1: Understand the pipeline order

    First, text must be converted to numbers using vectorization before training a model.
  2. Step 2: Follow logical flow

    After vectorizing, train the logistic regression model, then use it to predict on new vectorized text.
  3. Final Answer:

    Vectorize text -> Train logistic regression -> Predict on new text -> Option D
  4. Quick Check:

    Correct pipeline order = Vectorize text -> Train logistic regression -> Predict on new text [OK]
Hint: Always vectorize before training or predicting [OK]
Common Mistakes:
  • Trying to train before vectorizing
  • Predicting before training
  • Skipping vectorization step