Challenge - 5 Problems
Text Preprocessing Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate1:30remaining
What is the output of this text tokenization code?
Consider the following Python code that tokenizes a sentence into words using a simple split method. What is the output list?
NLP
sentence = "Hello, world! Welcome to AI." tokens = sentence.split() print(tokens)
Attempts:
2 left
💡 Hint
The split() method splits by spaces and does not remove punctuation.
✗ Incorrect
The split() method splits the string by spaces, so punctuation remains attached to words.
🧠 Conceptual
intermediate1:30remaining
Which step is NOT typically part of a text preprocessing pipeline?
In a typical text preprocessing pipeline for NLP tasks, which of the following steps is usually NOT included?
Attempts:
2 left
💡 Hint
Preprocessing prepares data before modeling.
✗ Incorrect
Training a model is a separate step after preprocessing; preprocessing focuses on cleaning and preparing text.
❓ Metrics
advanced2:00remaining
What is the vocabulary size after preprocessing?
Given the following list of sentences, after converting all text to lowercase, removing punctuation, and tokenizing, what is the size of the vocabulary (unique words)?
Sentences:
1. "AI is fun!"
2. "Fun with AI and machine learning."
3. "Learning AI is exciting."
NLP
import string sentences = ["AI is fun!", "Fun with AI and machine learning.", "Learning AI is exciting."] vocab = set() for sent in sentences: sent = sent.lower() sent = sent.translate(str.maketrans('', '', string.punctuation)) tokens = sent.split() vocab.update(tokens) print(len(vocab))
Attempts:
2 left
💡 Hint
Count unique words after cleaning and splitting.
✗ Incorrect
After cleaning and tokenizing, the unique words are: 'ai', 'is', 'fun', 'with', 'and', 'machine', 'learning', 'exciting' totaling 8.
🔧 Debug
advanced1:30remaining
Why does this text normalization code raise an error?
Examine the code below that attempts to normalize text by removing digits and converting to lowercase. Why does it raise an error?
NLP
import re text = "AI version 2.0 is here!" normalized = re.sub('[0-9]+', '', text).lower() print(normalized)
Attempts:
2 left
💡 Hint
Check the return type of re.sub and method chaining.
✗ Incorrect
re.sub returns a string, so calling lower() on it is valid. The code runs without error.
❓ Model Choice
expert2:00remaining
Which model is best suited for text classification after preprocessing?
After completing text preprocessing (tokenization, stopword removal, and vectorization), which model below is generally best for classifying short text documents?
Attempts:
2 left
💡 Hint
Consider models that handle sequences and context well.
✗ Incorrect
RNNs and LSTMs are designed to handle sequential data like text, making them suitable for text classification.