How to tokenize using spaCy in nlp

NlpHow-ToBeginner · 3 min read

How to Tokenize Text Using spaCy in NLP

To tokenize text using spaCy, first load a language model like en_core_web_sm, then pass your text to the model to create a Doc object. You can access tokens by iterating over this Doc object, where each token represents a word or punctuation.

📐

Syntax

To tokenize text with spaCy, you use a language model to process the text and create a Doc object. Each token in the Doc represents a word, punctuation, or symbol.

import spacy: imports the spaCy library.
nlp = spacy.load('en_core_web_sm'): loads the English language model.
doc = nlp(text): processes the text and returns a Doc object.
for token in doc:: iterates over tokens in the Doc.

python

import spacy

nlp = spacy.load('en_core_web_sm')
text = "Your text here."
doc = nlp(text)

for token in doc:
    print(token.text)

💻

Example

This example shows how to tokenize a simple sentence using spaCy. It prints each token separately, including words and punctuation.

python

import spacy

nlp = spacy.load('en_core_web_sm')
text = "Hello, world! Let's tokenize this sentence."
doc = nlp(text)

for token in doc:
    print(token.text)

Output

Hello , world ! Let 's tokenize this sentence .

⚠️

Common Pitfalls

Common mistakes when tokenizing with spaCy include:

Not loading a language model before processing text, which causes errors.
Trying to split text manually instead of using spaCy's tokenizer, which misses language rules.
Assuming tokens are always words; punctuation and spaces are also tokens.

python

import spacy

# Wrong: Not loading model
# doc = spacy(text)  # This will cause an error

# Right way:
nlp = spacy.load('en_core_web_sm')
doc = nlp("This is correct tokenization.")

for token in doc:
    print(token.text)

Output

This is correct tokenization .

📊

Quick Reference

Step	Description
Import spaCy	import spacy
Load model	nlp = spacy.load('en_core_web_sm')
Process text	doc = nlp(text)
Iterate tokens	for token in doc: print(token.text)

✅

Key Takeaways

Always load a spaCy language model before tokenizing text.

Tokenization splits text into words, punctuation, and symbols as tokens.

Iterate over the Doc object to access each token's text.

Avoid manual splitting; spaCy handles language-specific rules.

Tokens include punctuation and special characters, not just words.