0
0
NLPml~5 mins

Tokenization in spaCy in NLP

Choose your learning style9 modes available
Introduction
Tokenization breaks text into smaller pieces called tokens, like words or punctuation, so computers can understand and work with language.
When you want to split a sentence into words to analyze its meaning.
When preparing text data for machine learning models.
When counting how many words or punctuation marks are in a text.
When you want to find specific words or phrases in a document.
When cleaning text by separating and removing unwanted parts.
Syntax
NLP
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Your text here.')
for token in doc:
    print(token.text)
Load a language model with spacy.load before tokenizing.
The nlp object processes text and returns a Doc with tokens.
Examples
This splits the sentence into tokens including punctuation.
NLP
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Hello, world!')
for token in doc:
    print(token.text)
Collect tokens into a list for easier use later.
NLP
doc = nlp('I love AI.')
tokens = [token.text for token in doc]
print(tokens)
Access the first token directly by index.
NLP
doc = nlp('SpaCy is great for NLP.')
print(doc[0].text)
Sample Model
This program loads spaCy's English model, tokenizes the given sentence, and prints each token separately.
NLP
import spacy

# Load the English language model
nlp = spacy.load('en_core_web_sm')

# Text to tokenize
text = "Hello, spaCy! Let's tokenize this sentence."

# Process the text
doc = nlp(text)

# Print each token on a new line
for token in doc:
    print(token.text)
OutputSuccess
Important Notes
Tokens include words, punctuation, and spaces if relevant.
spaCy handles contractions like "Let's" by splitting into 'Let' and ''s'.
You can access token properties like lemma_, pos_, and is_stop for more analysis.
Summary
Tokenization splits text into smaller pieces called tokens.
spaCy makes tokenization easy with its language models.
Tokens can be accessed one by one or as a list for further processing.