Challenge - 5 Problems
Document Pipeline Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate2:00remaining
Key step order in a document processing pipeline
Which of the following sequences correctly represents the typical order of steps in a document processing pipeline?
Attempts:
2 left
💡 Hint
Think about how raw text is first broken down before cleaning and then converted to numbers.
✗ Incorrect
The typical pipeline starts by splitting text into tokens (words), then removing common words (stopwords), followed by reducing words to their base form (lemmatization), and finally converting text into numerical features.
❓ Predict Output
intermediate2:00remaining
Output of tokenizing and removing stopwords
What is the output of this Python code snippet?
NLP
import nltk from nltk.corpus import stopwords from nltk.tokenize import word_tokenize text = "The quick brown fox jumps over the lazy dog" stop_words = set(stopwords.words('english')) tokens = word_tokenize(text) filtered = [w for w in tokens if w.lower() not in stop_words] print(filtered)
Attempts:
2 left
💡 Hint
Stopwords like 'the' and 'over' are removed.
✗ Incorrect
The code tokenizes the sentence, then removes common English stopwords like 'the' and 'over'. The remaining words are returned in a list.
❓ Hyperparameter
advanced2:00remaining
Choosing n-gram range for feature extraction
In a document processing pipeline using TF-IDF vectorization, which n-gram range setting is best to capture both single words and pairs of words?
Attempts:
2 left
💡 Hint
You want to include both single words and two-word phrases.
✗ Incorrect
Setting ngram_range=(1,2) includes unigrams (single words) and bigrams (pairs of words), which helps capture more context.
❓ Metrics
advanced2:00remaining
Evaluating document classification with imbalanced classes
Which metric is most appropriate to evaluate a document classification model when classes are imbalanced?
Attempts:
2 left
💡 Hint
Consider a metric that balances precision and recall.
✗ Incorrect
F1 Score balances precision and recall, making it suitable for imbalanced datasets where accuracy can be misleading.
🔧 Debug
expert2:00remaining
Identifying error in document vectorization code
What error does this code raise when run, and why?
NLP
from sklearn.feature_extraction.text import TfidfVectorizer docs = ["Data science is fun", "Machine learning is powerful"] vectorizer = TfidfVectorizer(stop_words='english') X = vectorizer.fit_transform(docs) print(X.toarray()) print(vectorizer.get_feature_names_out())
Attempts:
2 left
💡 Hint
Check the latest method name for getting feature names in sklearn.
✗ Incorrect
In recent sklearn versions, get_feature_names() was replaced by get_feature_names_out(). Using the old method causes AttributeError.