You want to build a spam detection system that classifies emails as spam or not spam. Which model is best suited for this binary text classification task?
Think about models that work well with text features and binary classification.
SVM with TF-IDF features is a strong, simple model for text classification. K-Means is unsupervised and not for classification. CNNs for images won't work well on raw text. PCA is for reducing features, not classification.
Given this Python code that preprocesses email text, what is the output?
import re text = "Hello!!! This is spam??? Visit http://spam.com now." cleaned = re.sub(r'http\S+', '', text) cleaned = re.sub(r'[^a-zA-Z ]', '', cleaned).lower().split() print(cleaned)
Look at how URLs and punctuation are removed, and text is lowercased and split.
The regex removes URLs first, then removes all non-letter characters, then lowercases and splits into words.
In a spam detection pipeline, you use a TF-IDF vectorizer. Which max_features value is best to balance performance and speed on a large email dataset?
Too few features may miss important words; too many may slow training.
10 features is too small to capture text variety. 1,000,000 is too large and slow. None uses all features, which can be very large. 10,000 is a good balance.
Your spam detection model has these results on test data: 90% accuracy, 70% precision, 95% recall. What does this mean?
Recall is about catching spam; precision is about avoiding false alarms.
High recall (95%) means most spam is caught. Lower precision (70%) means some non-spam is wrongly flagged. Accuracy is high but less informative if classes are imbalanced.
Why does this spam detection training code raise a ValueError?
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression texts = ["Free money now", "Hello friend", "Win a prize"] labels = [1, 0] vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(texts) model = LogisticRegression() model.fit(X, labels)
Check if the number of labels matches the number of input samples.
The labels list has 2 items but texts has 3. The model expects one label per text sample.