0
0
NLPml~20 mins

Document processing pipeline in NLP - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Document Pipeline Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
Key step order in a document processing pipeline
Which of the following sequences correctly represents the typical order of steps in a document processing pipeline?
ALemmatization → Tokenization → Feature Extraction → Stopword Removal
BTokenization → Stopword Removal → Lemmatization → Feature Extraction
CStopword Removal → Feature Extraction → Tokenization → Lemmatization
DFeature Extraction → Tokenization → Lemmatization → Stopword Removal
Attempts:
2 left
💡 Hint
Think about how raw text is first broken down before cleaning and then converted to numbers.
Predict Output
intermediate
2:00remaining
Output of tokenizing and removing stopwords
What is the output of this Python code snippet?
NLP
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

text = "The quick brown fox jumps over the lazy dog"
stop_words = set(stopwords.words('english'))
tokens = word_tokenize(text)
filtered = [w for w in tokens if w.lower() not in stop_words]
print(filtered)
A['The', 'quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']
B['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
C['quick', 'brown', 'fox', 'jumps', 'over', 'lazy', 'dog']
D['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']
Attempts:
2 left
💡 Hint
Stopwords like 'the' and 'over' are removed.
Hyperparameter
advanced
2:00remaining
Choosing n-gram range for feature extraction
In a document processing pipeline using TF-IDF vectorization, which n-gram range setting is best to capture both single words and pairs of words?
Angram_range=(1,1)
Bngram_range=(2,2)
Cngram_range=(1,2)
Dngram_range=(2,3)
Attempts:
2 left
💡 Hint
You want to include both single words and two-word phrases.
Metrics
advanced
2:00remaining
Evaluating document classification with imbalanced classes
Which metric is most appropriate to evaluate a document classification model when classes are imbalanced?
AF1 Score
BPrecision
CAccuracy
DMean Squared Error
Attempts:
2 left
💡 Hint
Consider a metric that balances precision and recall.
🔧 Debug
expert
2:00remaining
Identifying error in document vectorization code
What error does this code raise when run, and why?
NLP
from sklearn.feature_extraction.text import TfidfVectorizer

docs = ["Data science is fun", "Machine learning is powerful"]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(docs)
print(X.toarray())
print(vectorizer.get_feature_names_out())
AAttributeError: 'TfidfVectorizer' object has no attribute 'get_feature_names' because the method was renamed
BTypeError: stop_words parameter must be a list, not a string
CValueError: Input documents must be non-empty strings
DNo error, prints the TF-IDF matrix and feature names
Attempts:
2 left
💡 Hint
Check the latest method name for getting feature names in sklearn.