Text preprocessing helps computers understand and work with words better by breaking text into pieces and simplifying words.
0
0
Text preprocessing (tokenization, stemming, lemmatization) in ML Python
Introduction
When you want to prepare text data for a chatbot to understand user messages.
When analyzing customer reviews to find common opinions or feelings.
When building a search engine that finds documents based on keywords.
When cleaning up text data before training a machine learning model.
When summarizing large articles by focusing on important words.
Syntax
ML Python
from nltk.tokenize import word_tokenize from nltk.stem import PorterStemmer from nltk.stem import WordNetLemmatizer # Tokenization words = word_tokenize(text) # Stemming stemmer = PorterStemmer() stemmed_words = [stemmer.stem(word) for word in words] # Lemmatization lemmatizer = WordNetLemmatizer() lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
Tokenization splits text into words or pieces.
Stemming cuts words to their root form, which may not be a real word.
Examples
This splits the sentence into individual words and punctuation.
ML Python
text = "Cats are running faster than dogs." words = word_tokenize(text) print(words)
This reduces words to their stem forms, like 'running' to 'run'.
ML Python
stemmer = PorterStemmer() stemmed = [stemmer.stem(word) for word in words] print(stemmed)
This converts words to their dictionary form, like 'running' to 'running' or 'ran' to 'run'.
ML Python
lemmatizer = WordNetLemmatizer() lemmas = [lemmatizer.lemmatize(word) for word in words] print(lemmas)
Sample Model
This program shows how to split text into words, then simplify those words by stemming and lemmatization.
ML Python
import nltk nltk.download('punkt') nltk.download('wordnet') from nltk.tokenize import word_tokenize from nltk.stem import PorterStemmer from nltk.stem import WordNetLemmatizer text = "The cats are running faster than the dogs." # Tokenize words = word_tokenize(text) print("Tokens:", words) # Stem stemmer = PorterStemmer() stemmed_words = [stemmer.stem(word) for word in words] print("Stemmed:", stemmed_words) # Lemmatize lemmatizer = WordNetLemmatizer() lemmatized_words = [lemmatizer.lemmatize(word) for word in words] print("Lemmatized:", lemmatized_words)
OutputSuccess
Important Notes
Stemming may produce words that are not real, while lemmatization produces real dictionary words.
Lemmatization needs the word's part of speech to be most accurate, but defaults still help.
Tokenization handles punctuation and splits text into manageable pieces.
Summary
Tokenization breaks text into words or tokens.
Stemming cuts words to their root but may not keep real words.
Lemmatization changes words to their dictionary form for better understanding.