Introduction
Tokenization breaks text into smaller pieces like words or sentences. This helps computers understand and work with language step by step.
Jump into concepts and practice - no test required
Tokenization breaks text into smaller pieces like words or sentences. This helps computers understand and work with language step by step.
from nltk.tokenize import word_tokenize, sent_tokenize text = "Your text here." words = word_tokenize(text) sentences = sent_tokenize(text)
Use word_tokenize to split text into words.
Use sent_tokenize to split text into sentences.
from nltk.tokenize import word_tokenize text = "Hello world!" words = word_tokenize(text) print(words)
from nltk.tokenize import sent_tokenize text = "Hello world! How are you?" sentences = sent_tokenize(text) print(sentences)
This program splits the text into words and sentences, then prints both lists.
from nltk.tokenize import word_tokenize, sent_tokenize text = "Machine learning is fun. It helps computers learn from data!" words = word_tokenize(text) sentences = sent_tokenize(text) print("Words:", words) print("Sentences:", sentences)
Tokenization depends on language rules, so results may vary for different languages.
Some punctuation marks are treated as separate tokens.
Make sure to install NLTK and download the required data with nltk.download('punkt') before running.
Tokenization splits text into words or sentences.
It is the first step in many language tasks.
Use libraries like NLTK for easy tokenization.
from nltk.tokenize import sent_tokenize text = 'Hello world! How are you?' sentences = sent_tokenize(text) print(sentences)
import nltk
tokens = nltk.word_tokenize('Hello world!')