ML Pythonml~5 mins

Text preprocessing (tokenization, stemming, lemmatization) in ML Python

Choose your learning style9 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Introduction

Text preprocessing helps computers understand and work with words better by breaking text into pieces and simplifying words.

When you want to prepare text data for a chatbot to understand user messages.

When analyzing customer reviews to find common opinions or feelings.

When building a search engine that finds documents based on keywords.

When cleaning up text data before training a machine learning model.

When summarizing large articles by focusing on important words.

Syntax

ML Python

from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Tokenization
words = word_tokenize(text)

# Stemming
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in words]

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

Tokenization splits text into words or pieces.

Stemming cuts words to their root form, which may not be a real word.

Examples

This splits the sentence into individual words and punctuation.

ML Python

text = "Cats are running faster than dogs."
words = word_tokenize(text)
print(words)

This reduces words to their stem forms, like 'running' to 'run'.

ML Python

stemmer = PorterStemmer()
stemmed = [stemmer.stem(word) for word in words]
print(stemmed)

This converts words to their dictionary form, like 'running' to 'running' or 'ran' to 'run'.

ML Python

lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(word) for word in words]
print(lemmas)

Sample Model

This program shows how to split text into words, then simplify those words by stemming and lemmatization.

ML Python

import nltk
nltk.download('punkt')
nltk.download('wordnet')
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

text = "The cats are running faster than the dogs."

# Tokenize
words = word_tokenize(text)
print("Tokens:", words)

# Stem
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in words]
print("Stemmed:", stemmed_words)

# Lemmatize
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print("Lemmatized:", lemmatized_words)

OutputSuccess

Important Notes

Stemming may produce words that are not real, while lemmatization produces real dictionary words.

Lemmatization needs the word's part of speech to be most accurate, but defaults still help.

Tokenization handles punctuation and splits text into manageable pieces.

Summary

Tokenization breaks text into words or tokens.

Stemming cuts words to their root but may not keep real words.

Lemmatization changes words to their dictionary form for better understanding.