0
0
NLPml~20 mins

Text preprocessing pipelines in NLP - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Text Preprocessing Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
1:30remaining
What is the output of this text tokenization code?
Consider the following Python code that tokenizes a sentence into words using a simple split method. What is the output list?
NLP
sentence = "Hello, world! Welcome to AI." 
tokens = sentence.split()
print(tokens)
A['Hello,', 'world!', 'Welcome', 'to', 'AI.']
B['Hello', 'world', 'Welcome', 'to', 'AI']
C['Hello', ',', 'world', '!', 'Welcome', 'to', 'AI', '.']
D['Hello world Welcome to AI']
Attempts:
2 left
💡 Hint
The split() method splits by spaces and does not remove punctuation.
🧠 Conceptual
intermediate
1:30remaining
Which step is NOT typically part of a text preprocessing pipeline?
In a typical text preprocessing pipeline for NLP tasks, which of the following steps is usually NOT included?
AApplying stemming or lemmatization
BRemoving stopwords
CTraining a neural network model
DLowercasing all text
Attempts:
2 left
💡 Hint
Preprocessing prepares data before modeling.
Metrics
advanced
2:00remaining
What is the vocabulary size after preprocessing?
Given the following list of sentences, after converting all text to lowercase, removing punctuation, and tokenizing, what is the size of the vocabulary (unique words)? Sentences: 1. "AI is fun!" 2. "Fun with AI and machine learning." 3. "Learning AI is exciting."
NLP
import string
sentences = ["AI is fun!", "Fun with AI and machine learning.", "Learning AI is exciting."]
vocab = set()
for sent in sentences:
    sent = sent.lower()
    sent = sent.translate(str.maketrans('', '', string.punctuation))
    tokens = sent.split()
    vocab.update(tokens)
print(len(vocab))
A10
B8
C9
D7
Attempts:
2 left
💡 Hint
Count unique words after cleaning and splitting.
🔧 Debug
advanced
1:30remaining
Why does this text normalization code raise an error?
Examine the code below that attempts to normalize text by removing digits and converting to lowercase. Why does it raise an error?
NLP
import re
text = "AI version 2.0 is here!"
normalized = re.sub('[0-9]+', '', text).lower()
print(normalized)
ARaises AttributeError because lower() is called on None
BRaises TypeError because re.sub returns bytes, not string
CRaises SyntaxError due to incorrect regex pattern
DNo error; the code runs and outputs 'ai version . is here!'
Attempts:
2 left
💡 Hint
Check the return type of re.sub and method chaining.
Model Choice
expert
2:00remaining
Which model is best suited for text classification after preprocessing?
After completing text preprocessing (tokenization, stopword removal, and vectorization), which model below is generally best for classifying short text documents?
ARecurrent Neural Network (RNN) or LSTM tailored for sequences
BConvolutional Neural Network (CNN) designed for images
CLinear Regression model
DK-Means clustering algorithm
Attempts:
2 left
💡 Hint
Consider models that handle sequences and context well.