NLPml~20 mins

Tokenization in spaCy in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Tokenization in spaCy

Problem:You want to split sentences into words or tokens using spaCy, but your current tokenization splits contractions incorrectly and includes punctuation as separate tokens.

Current Metrics:Example input: "I'm learning spaCy!" Current tokens: ['I', ''', 'm', 'learning', 'spaCy', '!']

Issue:The tokenizer splits contractions like "I'm" into separate tokens ('I', ''', 'm') and treats punctuation as separate tokens, which may not be desired for your application.

Your Task

Adjust spaCy's tokenizer so that contractions like "I'm" are treated as single tokens and punctuation is handled according to your needs.

Use spaCy's built-in tokenizer customization features.

Do not write a tokenizer from scratch.

Keep the solution runnable with spaCy version 3.x or later.

Hint 1

Hint 2

Hint 3

Solution

NLP

import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_infix_regex

# Load the English model
nlp = spacy.load('en_core_web_sm')

# Customize tokenizer to keep contractions as single tokens
# Define special cases for contractions
special_cases = {"I'm": [{"ORTH": "I'm"}], "don't": [{"ORTH": "don't"}], "can't": [{"ORTH": "can't"}]}

# Create a new tokenizer with the special cases
prefix_re = spacy.util.compile_prefix_regex(nlp.Defaults.prefixes)
suffix_re = spacy.util.compile_suffix_regex(nlp.Defaults.suffixes)
infix_re = compile_infix_regex([r'(?<=[0-9])[+\-*/](?=[0-9-])'])  # Simplify infix rules to avoid splitting contractions

def custom_tokenizer(nlp):
    tokenizer = Tokenizer(
        nlp.vocab,
        rules=special_cases,
        prefix_search=prefix_re.search,
        suffix_search=suffix_re.search,
        infix_finditer=infix_re.finditer,
        token_match=None,
        url_match=None
    )
    return tokenizer

nlp.tokenizer = custom_tokenizer(nlp)

# Test the tokenizer
text = "I'm learning spaCy! Don't you like it?"
doc = nlp(text)
tokens = [token.text for token in doc]
print(tokens)

Added special cases for common contractions to keep them as single tokens.

Customized the tokenizer by creating a new Tokenizer instance with these special cases.

Simplified infix regex to avoid splitting contractions at apostrophes.

Replaced the default tokenizer with the customized one in the spaCy pipeline.

Results Interpretation

Before: ['I', "'", 'm', 'learning', 'spaCy', '!']
After: ['I'm', 'learning', 'spaCy', '!']

Customizing spaCy's tokenizer with special cases allows you to control how text is split into tokens, which is important for handling contractions and punctuation correctly.

Bonus Experiment

Try customizing the tokenizer to merge multi-word expressions like 'New York' into single tokens.

💡 Hint

Use spaCy's PhraseMatcher or add special cases for multi-word expressions to the tokenizer.

Practice

(1/5)

1. What does tokenization do in spaCy?

easy

A. It splits text into smaller pieces called tokens.

B. It trains a machine learning model.

C. It translates text into another language.

D. It visualizes text data.

Tokenization in spaCy in NLP - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand tokenization concept

Step 2: Relate to spaCy functionality

Final Answer:

Quick Check:

Solution

Step 1: Recall spaCy model loading syntax

Step 2: Identify correct model name and function

Final Answer:

Quick Check:

Solution

Step 1: Understand spaCy tokenization behavior

Step 2: Analyze the given text 'Hello, world!'

Final Answer:

Quick Check:

Solution

Step 1: Check Python syntax for loops

Step 2: Inspect the given code

Final Answer:

Quick Check:

Solution

Step 1: Understand spaCy's default tokenizer behavior

Step 2: Identify how to keep contractions as one token

Final Answer:

Quick Check: