0
0
NLPml~5 mins

Tokenization (word and sentence) in NLP

Choose your learning style9 modes available
Introduction

Tokenization breaks text into smaller pieces like words or sentences. This helps computers understand and work with language step by step.

When you want to count how many words are in a text message.
When you need to split a paragraph into sentences to analyze each one separately.
When preparing text data for a chatbot to understand user input.
When cleaning text before translating it to another language.
When building a search engine that finds documents by words.
Syntax
NLP
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Your text here."

words = word_tokenize(text)
sentences = sent_tokenize(text)

Use word_tokenize to split text into words.

Use sent_tokenize to split text into sentences.

Examples
This splits the text into words: ['Hello', 'world', '!']
NLP
from nltk.tokenize import word_tokenize

text = "Hello world!"
words = word_tokenize(text)
print(words)
This splits the text into sentences: ['Hello world!', 'How are you?']
NLP
from nltk.tokenize import sent_tokenize

text = "Hello world! How are you?"
sentences = sent_tokenize(text)
print(sentences)
Sample Model

This program splits the text into words and sentences, then prints both lists.

NLP
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Machine learning is fun. It helps computers learn from data!"

words = word_tokenize(text)
sentences = sent_tokenize(text)

print("Words:", words)
print("Sentences:", sentences)
OutputSuccess
Important Notes

Tokenization depends on language rules, so results may vary for different languages.

Some punctuation marks are treated as separate tokens.

Make sure to install NLTK and download the required data with nltk.download('punkt') before running.

Summary

Tokenization splits text into words or sentences.

It is the first step in many language tasks.

Use libraries like NLTK for easy tokenization.