Bird
Raised Fist0
NlpHow-ToBeginner ยท 3 min read

How to Tokenize Text Using spaCy in NLP

To tokenize text using spaCy, first load a language model like en_core_web_sm, then pass your text to the model to create a Doc object. You can access tokens by iterating over this Doc object, where each token represents a word or punctuation.
๐Ÿ“

Syntax

To tokenize text with spaCy, you use a language model to process the text and create a Doc object. Each token in the Doc represents a word, punctuation, or symbol.

  • import spacy: imports the spaCy library.
  • nlp = spacy.load('en_core_web_sm'): loads the English language model.
  • doc = nlp(text): processes the text and returns a Doc object.
  • for token in doc:: iterates over tokens in the Doc.
python
import spacy

nlp = spacy.load('en_core_web_sm')
text = "Your text here."
doc = nlp(text)

for token in doc:
    print(token.text)
๐Ÿ’ป

Example

This example shows how to tokenize a simple sentence using spaCy. It prints each token separately, including words and punctuation.

python
import spacy

nlp = spacy.load('en_core_web_sm')
text = "Hello, world! Let's tokenize this sentence."
doc = nlp(text)

for token in doc:
    print(token.text)
Output
Hello , world ! Let 's tokenize this sentence .
โš ๏ธ

Common Pitfalls

Common mistakes when tokenizing with spaCy include:

  • Not loading a language model before processing text, which causes errors.
  • Trying to split text manually instead of using spaCy's tokenizer, which misses language rules.
  • Assuming tokens are always words; punctuation and spaces are also tokens.
python
import spacy

# Wrong: Not loading model
# doc = spacy(text)  # This will cause an error

# Right way:
nlp = spacy.load('en_core_web_sm')
doc = nlp("This is correct tokenization.")

for token in doc:
    print(token.text)
Output
This is correct tokenization .
๐Ÿ“Š

Quick Reference

StepDescription
Import spaCyimport spacy
Load modelnlp = spacy.load('en_core_web_sm')
Process textdoc = nlp(text)
Iterate tokensfor token in doc: print(token.text)
โœ…

Key Takeaways

Always load a spaCy language model before tokenizing text.
Tokenization splits text into words, punctuation, and symbols as tokens.
Iterate over the Doc object to access each token's text.
Avoid manual splitting; spaCy handles language-specific rules.
Tokens include punctuation and special characters, not just words.