How to Use word_tokenize from NLTK in NLP
Use
word_tokenize from the NLTK library to split text into individual words or tokens. First, import it with from nltk.tokenize import word_tokenize, then call word_tokenize(text) on your string to get a list of word tokens.Syntax
The word_tokenize function takes a single string input and returns a list of word tokens. It handles punctuation and splits contractions properly.
text: The input string to tokenize.- Returns: A list of strings, each a word or punctuation token.
python
from nltk.tokenize import word_tokenize tokens = word_tokenize(text)
Example
This example shows how to tokenize a simple sentence into words using word_tokenize. It splits words and punctuation correctly.
python
from nltk.tokenize import word_tokenize text = "Hello, world! Let's learn NLP with NLTK." tokens = word_tokenize(text) print(tokens)
Output
['Hello', ',', 'world', '!', 'Let', "'s", 'learn', 'NLP', 'with', 'NLTK', '.']
Common Pitfalls
Common mistakes include not importing word_tokenize correctly or forgetting to download the required NLTK data. You must run nltk.download('punkt') once to get the tokenizer models.
Also, passing non-string inputs will cause errors.
python
import nltk # Wrong: forgetting to download punkt # tokens = word_tokenize("Hello world") # This may raise LookupError # Correct way: nltk.download('punkt') from nltk.tokenize import word_tokenize text = "Hello world" tokens = word_tokenize(text) print(tokens)
Output
['Hello', 'world']
Quick Reference
Remember these tips when using word_tokenize:
- Import with
from nltk.tokenize import word_tokenize - Download tokenizer data once with
nltk.download('punkt') - Input must be a string
- Output is a list of word and punctuation tokens
Key Takeaways
Import word_tokenize from nltk.tokenize to split text into words.
Always run nltk.download('punkt') once before using word_tokenize.
word_tokenize returns a list of words and punctuation tokens.
Input to word_tokenize must be a string to avoid errors.
Use word_tokenize to prepare text for NLP tasks like analysis or modeling.
