Text classification helps organize documents by putting them into groups based on their content. This makes it easier to find, sort, and understand large amounts of text.
0
0
Why text classification categorizes documents in NLP
Introduction
Sorting emails into spam or inbox folders automatically.
Tagging news articles by topic like sports, politics, or entertainment.
Filtering customer reviews as positive or negative to understand feedback.
Organizing support tickets by issue type to speed up responses.
Detecting language or sentiment in social media posts.
Syntax
NLP
model.fit(X_train, y_train) predictions = model.predict(X_test)
fit trains the model using labeled text data.
predict assigns categories to new, unseen text.
Examples
This example trains a simple model to classify text as positive or negative, then predicts the label for a new sentence.
NLP
from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB texts = ['I love cats', 'I hate rain'] labels = ['positive', 'negative'] vectorizer = CountVectorizer() X = vectorizer.fit_transform(texts) model = MultinomialNB() model.fit(X, labels) new_text = ['I love rain'] X_new = vectorizer.transform(new_text) prediction = model.predict(X_new) print(prediction)
This example uses a pipeline to combine text vectorization and classification in one step.
NLP
from sklearn.pipeline import make_pipeline from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression texts = ['Sports are fun', 'Politics is complex'] labels = ['sports', 'politics'] model = make_pipeline(TfidfVectorizer(), LogisticRegression()) model.fit(texts, labels) print(model.predict(['I like sports']))
Sample Model
This program trains a text classifier to categorize news articles into baseball or space topics. It shows how well the model works by printing accuracy and some predictions.
NLP
from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import accuracy_score # Load a small subset of news articles categories = ['rec.sport.baseball', 'sci.space'] data_train = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'footers', 'quotes')) data_test = fetch_20newsgroups(subset='test', categories=categories, remove=('headers', 'footers', 'quotes')) # Convert text to numbers vectorizer = TfidfVectorizer() X_train = vectorizer.fit_transform(data_train.data) X_test = vectorizer.transform(data_test.data) # Train a simple classifier model = MultinomialNB() model.fit(X_train, data_train.target) # Predict categories for test data predictions = model.predict(X_test) # Calculate accuracy accuracy = accuracy_score(data_test.target, predictions) print(f"Accuracy: {accuracy:.2f}") print(f"First 5 predictions: {predictions[:5]}")
OutputSuccess
Important Notes
Text classification needs labeled examples to learn from.
Good text representation (like TF-IDF) helps the model understand words better.
Accuracy shows how often the model guesses the right category.
Summary
Text classification groups documents by their content automatically.
It helps organize and find information quickly.
Simple models can learn from examples and predict new text categories.