0
0
NLPml~20 mins

Tokenization in spaCy in NLP - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Tokenization in spaCy
Problem:You want to split sentences into words or tokens using spaCy, but your current tokenization splits contractions incorrectly and includes punctuation as separate tokens.
Current Metrics:Example input: "I'm learning spaCy!" Current tokens: ['I', ''', 'm', 'learning', 'spaCy', '!']
Issue:The tokenizer splits contractions like "I'm" into separate tokens ('I', ''', 'm') and treats punctuation as separate tokens, which may not be desired for your application.
Your Task
Adjust spaCy's tokenizer so that contractions like "I'm" are treated as single tokens and punctuation is handled according to your needs.
Use spaCy's built-in tokenizer customization features.
Do not write a tokenizer from scratch.
Keep the solution runnable with spaCy version 3.x or later.
Hint 1
Hint 2
Hint 3
Solution
NLP
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_infix_regex

# Load the English model
nlp = spacy.load('en_core_web_sm')

# Customize tokenizer to keep contractions as single tokens
# Define special cases for contractions
special_cases = {"I'm": [{"ORTH": "I'm"}], "don't": [{"ORTH": "don't"}], "can't": [{"ORTH": "can't"}]}

# Create a new tokenizer with the special cases
prefix_re = spacy.util.compile_prefix_regex(nlp.Defaults.prefixes)
suffix_re = spacy.util.compile_suffix_regex(nlp.Defaults.suffixes)
infix_re = compile_infix_regex([r'(?<=[0-9])[+\-*/](?=[0-9-])'])  # Simplify infix rules to avoid splitting contractions

def custom_tokenizer(nlp):
    tokenizer = Tokenizer(
        nlp.vocab,
        rules=special_cases,
        prefix_search=prefix_re.search,
        suffix_search=suffix_re.search,
        infix_finditer=infix_re.finditer,
        token_match=None,
        url_match=None
    )
    return tokenizer

nlp.tokenizer = custom_tokenizer(nlp)

# Test the tokenizer
text = "I'm learning spaCy! Don't you like it?"
doc = nlp(text)
tokens = [token.text for token in doc]
print(tokens)
Added special cases for common contractions to keep them as single tokens.
Customized the tokenizer by creating a new Tokenizer instance with these special cases.
Simplified infix regex to avoid splitting contractions at apostrophes.
Replaced the default tokenizer with the customized one in the spaCy pipeline.
Results Interpretation

Before: ['I', "'", 'm', 'learning', 'spaCy', '!']
After: ['I'm', 'learning', 'spaCy', '!']

Customizing spaCy's tokenizer with special cases allows you to control how text is split into tokens, which is important for handling contractions and punctuation correctly.
Bonus Experiment
Try customizing the tokenizer to merge multi-word expressions like 'New York' into single tokens.
💡 Hint
Use spaCy's PhraseMatcher or add special cases for multi-word expressions to the tokenizer.