Challenge - 5 Problems
Sentiment Analysis Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Model Choice
intermediate2:00remaining
Choosing the best model for sentiment analysis
You want to build a sentiment analysis model using scikit-learn on a dataset of movie reviews labeled positive or negative. Which model is most suitable for this binary text classification task?
Attempts:
2 left
💡 Hint
Think about models designed for classification, not clustering or regression.
✗ Incorrect
LinearSVC is a linear classifier suitable for binary classification tasks like sentiment analysis. KMeans and DBSCAN are clustering algorithms, not classifiers. LinearRegression is for continuous output, not categories.
❓ Predict Output
intermediate1:30remaining
Output of text vectorization step
What is the shape of the feature matrix X after applying CountVectorizer to 1000 text reviews, if the vocabulary size is 5000?
ML Python
from sklearn.feature_extraction.text import CountVectorizer texts = ['sample text data'] * 1000 vectorizer = CountVectorizer(max_features=5000) X = vectorizer.fit_transform(texts) print(X.shape)
Attempts:
2 left
💡 Hint
Rows represent samples, columns represent features.
✗ Incorrect
CountVectorizer transforms text into a matrix where each row is a sample and each column is a word feature. With 1000 samples and 5000 features, shape is (1000, 5000).
❓ Hyperparameter
advanced1:30remaining
Choosing the right hyperparameter for LogisticRegression
You train a LogisticRegression model for sentiment analysis. Which hyperparameter controls the strength of regularization to prevent overfitting?
Attempts:
2 left
💡 Hint
Look for the parameter that adjusts how much the model avoids complexity.
✗ Incorrect
The 'C' parameter in LogisticRegression is the inverse of regularization strength. Smaller values specify stronger regularization, helping to reduce overfitting.
❓ Metrics
advanced2:00remaining
Interpreting classification report metrics
After training a sentiment classifier, you get these metrics for the positive class: precision=0.8, recall=0.5, f1-score=0.62. What does the low recall indicate?
Attempts:
2 left
💡 Hint
Recall measures how many actual positives are found.
✗ Incorrect
Recall is the fraction of true positives found out of all actual positives. Low recall means many positive samples are missed (false negatives).
🔧 Debug
expert2:30remaining
Debugging unexpected accuracy drop after vectorizer change
You trained a sentiment model with CountVectorizer and got 85% accuracy. After switching to TfidfVectorizer with default settings, accuracy dropped to 70%. What is the most likely cause?
Attempts:
2 left
💡 Hint
Think about how TF-IDF changes word importance compared to raw counts.
✗ Incorrect
TfidfVectorizer scales down frequent words, which can reduce the weight of common sentiment words that were important in CountVectorizer, causing accuracy drop.