0
0
NLPml~5 mins

SVM for text classification in NLP

Choose your learning style9 modes available
Introduction
SVM helps us sort text into groups by finding the best line that separates different categories clearly.
Sorting emails into spam or not spam
Classifying movie reviews as positive or negative
Organizing news articles by topic
Filtering customer feedback into categories
Detecting language of short text messages
Syntax
NLP
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC

# Convert text to numbers
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train_texts)
X_test = vectorizer.transform(test_texts)

# Create SVM model
model = SVC(kernel='linear')

# Train model
model.fit(X_train, train_labels)

# Predict new data
predictions = model.predict(X_test)
TfidfVectorizer changes text into numbers that the SVM can understand.
Using a 'linear' kernel is common for text because it works well with many features.
Examples
This sets up a simple SVM that draws a straight line to separate classes.
NLP
model = SVC(kernel='linear')
This removes common English words like 'the' or 'and' to focus on important words.
NLP
vectorizer = TfidfVectorizer(stop_words='english')
This line asks the model to guess the categories for new text data.
NLP
predictions = model.predict(X_test)
Sample Model
This program trains an SVM to tell positive and negative movie reviews apart using simple example sentences. It then predicts new sentences and shows how accurate it is.
NLP
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Sample text data
train_texts = [
    'I love this movie',
    'This film was terrible',
    'Amazing story and great acting',
    'Worst movie ever',
    'I enjoyed the film a lot'
]
train_labels = [1, 0, 1, 0, 1]  # 1=positive, 0=negative

test_texts = [
    'I hate this movie',
    'What a fantastic film'
]

def main():
    # Convert text to numbers
    vectorizer = TfidfVectorizer(stop_words='english')
    X_train = vectorizer.fit_transform(train_texts)
    X_test = vectorizer.transform(test_texts)

    # Create and train SVM model
    model = SVC(kernel='linear')
    model.fit(X_train, train_labels)

    # Predict test data
    predictions = model.predict(X_test)

    # Show predictions
    print('Predictions:', predictions.tolist())

    # For demonstration, assume true labels for test
    true_labels = [0, 1]
    accuracy = accuracy_score(true_labels, predictions)
    print(f'Accuracy: {accuracy:.2f}')

if __name__ == '__main__':
    main()
OutputSuccess
Important Notes
SVM works well with many features, which is common in text data after vectorization.
Choosing the right text vectorizer (like TF-IDF) helps the SVM focus on important words.
Linear kernel is usually enough for text classification, making training faster.
Summary
SVM finds the best line to separate text categories.
Text must be changed into numbers before using SVM.
TF-IDF vectorizer and linear kernel are common choices for text classification.