How to Use sent_tokenize from NLTK for Sentence Tokenization
Use
sent_tokenize from the NLTK library to split a text into a list of sentences. First, import it with from nltk.tokenize import sent_tokenize, then call sent_tokenize(text) where text is your string. This helps break paragraphs into sentences for easier processing in NLP.Syntax
The sent_tokenize function takes a single string input and returns a list of sentences. It uses pre-trained models to detect sentence boundaries.
text: The input string containing one or more sentences.- Returns: A list of sentence strings.
python
from nltk.tokenize import sent_tokenize sentences = sent_tokenize(text)
Example
This example shows how to split a paragraph into sentences using sent_tokenize. It prints each sentence separately.
python
from nltk.tokenize import sent_tokenize text = "Hello there! How are you doing today? This is an example of sentence tokenization." sentences = sent_tokenize(text) for i, sentence in enumerate(sentences, 1): print(f"Sentence {i}: {sentence}")
Output
Sentence 1: Hello there!
Sentence 2: How are you doing today?
Sentence 3: This is an example of sentence tokenization.
Common Pitfalls
Common mistakes include:
- Not importing
sent_tokenizecorrectly. - Passing non-string inputs like lists or numbers.
- Forgetting to download the required NLTK data package
punktwhichsent_tokenizedepends on.
Always run nltk.download('punkt') once before using sent_tokenize.
python
import nltk # Wrong: passing a list instead of string # sentences = sent_tokenize(["Hello there!", "How are you?"]) # Correct usage: from nltk.tokenize import sent_tokenize nltk.download('punkt') # Run once text = "Hello there! How are you?" sentences = sent_tokenize(text)
Quick Reference
| Function | Description | Input | Output |
|---|---|---|---|
| sent_tokenize(text) | Splits text into sentences | String (text) | List of sentence strings |
Key Takeaways
Use sent_tokenize to split text into sentences easily in NLP tasks.
Always import sent_tokenize from nltk.tokenize before using it.
Download the 'punkt' package with nltk.download('punkt') once to enable sentence tokenization.
Pass only strings to sent_tokenize; other types will cause errors.
The output is a list of sentences, useful for further text processing.
