Support Vector Machines (SVM) are used for text classification. How does SVM process text data before training?
Think about how computers understand text data for machine learning.
SVM cannot work with raw text strings. Text must be converted into numbers, often using methods like TF-IDF or word embeddings, so the model can process it.
Given the following Python code using sklearn's SVM for text classification, what is the printed output?
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.svm import SVC texts = ['I love apples', 'I hate bananas'] labels = [1, 0] vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(texts) model = SVC(kernel='linear') model.fit(X, labels) new_text = ['I love bananas'] X_new = vectorizer.transform(new_text) prediction = model.predict(X_new) print(prediction[0])
Consider how SVM predicts based on learned features and similarity.
The model learned that 'love' is associated with label 1. Even though 'bananas' was in the training data with label 0, the presence of 'love' influences the prediction to 1.
Which kernel is generally best suited for SVM when classifying text data represented by TF-IDF vectors?
Think about the nature of TF-IDF vectors and their dimensionality.
Text data transformed by TF-IDF is usually high-dimensional and sparse, making it often linearly separable. The linear kernel is efficient and effective in this case.
You trained an SVM classifier on imbalanced text data. Which metric is most reliable to evaluate the model's performance?
Consider what happens when classes are imbalanced.
Accuracy can be misleading on imbalanced data. F1-score combines precision and recall, giving a better sense of model performance on minority classes.
Examine the code below. Why does it raise an error during training?
from sklearn.feature_extraction.text import CountVectorizer from sklearn.svm import SVC texts = ['good movie', 'bad movie', 'great film'] labels = [1, 0, 1] vectorizer = CountVectorizer() X = vectorizer.fit_transform(texts) model = SVC(kernel='linear') model.fit(X, labels)
Check the size of inputs and labels carefully.
The labels list has only 2 elements but there are 3 text samples. This mismatch causes a ValueError during model training.