We use an NLP pipeline to turn text into useful information step-by-step. It helps computers understand human language.
0
0
First NLP pipeline
Introduction
You want to find the main topics in customer reviews.
You need to check if emails are spam or not.
You want to translate sentences from one language to another.
You want to find names of people or places in news articles.
You want to summarize long documents into short points.
Syntax
NLP
from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB pipeline = Pipeline([ ('vectorizer', CountVectorizer()), ('classifier', MultinomialNB()) ])
The pipeline is a list of steps, each with a name and a tool.
Text data flows through each step in order.
Examples
This pipeline removes common English words before classifying.
NLP
pipeline = Pipeline([
('vectorizer', CountVectorizer(stop_words='english')),
('classifier', MultinomialNB())
])This pipeline uses single words and pairs of words to understand text better.
NLP
pipeline = Pipeline([
('vectorizer', CountVectorizer(ngram_range=(1,2))),
('classifier', MultinomialNB())
])Sample Model
This program creates a simple NLP pipeline that turns text into numbers and then classifies if the text is positive or negative. It trains on some examples and tests on others, then shows predictions and accuracy.
NLP
from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # Sample text data and labels texts = [ 'I love this movie', 'This film was terrible', 'Amazing acting and story', 'I did not like the film', 'Best movie ever', 'Worst movie I have seen' ] labels = [1, 0, 1, 0, 1, 0] # 1=positive, 0=negative # Split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.33, random_state=42) # Create the NLP pipeline pipeline = Pipeline([ ('vectorizer', CountVectorizer()), ('classifier', MultinomialNB()) ]) # Train the model pipeline.fit(X_train, y_train) # Predict on test data predictions = pipeline.predict(X_test) # Calculate accuracy accuracy = accuracy_score(y_test, predictions) print(f'Predictions: {predictions}') print(f'Accuracy: {accuracy:.2f}')
OutputSuccess
Important Notes
Always split your data into training and testing to check if your model works well.
CountVectorizer turns words into numbers that the model can understand.
MultinomialNB is a simple and fast classifier good for text data.
Summary
An NLP pipeline processes text step-by-step to make predictions.
Use vectorizers to convert text into numbers.
Train and test your pipeline to see how well it works.