0
0
NLPml~20 mins

Tokenization (word and sentence) in NLP - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Tokenization (word and sentence)
Problem:You want to split text into words and sentences correctly to prepare data for NLP tasks.
Current Metrics:Current tokenization splits words and sentences but sometimes merges punctuation or misses sentence boundaries.
Issue:The tokenization is inconsistent, causing errors in downstream tasks like sentiment analysis or translation.
Your Task
Improve tokenization so that words and sentences are split accurately, with punctuation handled properly.
Use only Python and the NLTK library for tokenization.
Do not use any external paid APIs or services.
Hint 1
Hint 2
Hint 3
Solution
NLP
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

# Download required NLTK data files
nltk.download('punkt')

text = "Hello there! How are you doing today? I hope everything's fine. Let's test tokenization."

# Sentence tokenization
sentences = sent_tokenize(text)

# Word tokenization for each sentence
words_per_sentence = [word_tokenize(sentence) for sentence in sentences]

print('Sentences:', sentences)
print('Words per sentence:', words_per_sentence)
Added NLTK's sent_tokenize to split text into sentences accurately.
Used NLTK's word_tokenize to split each sentence into words, handling punctuation correctly.
Downloaded 'punkt' tokenizer models required by NLTK.
Results Interpretation

Before: Text was split incorrectly, merging punctuation with words or missing sentence ends.
After: Sentences are split into 4 clear parts, and words include punctuation as separate tokens, improving accuracy.

Using specialized tokenization tools like NLTK's sent_tokenize and word_tokenize improves text splitting accuracy, which is crucial for reliable NLP processing.
Bonus Experiment
Try tokenizing the same text using the SpaCy library and compare results with NLTK.
💡 Hint
Use SpaCy's English model and its doc.sents and doc tokens for sentence and word tokenization.