How to Tokenize Text Using spaCy in NLP
To tokenize text using
spaCy, first load a language model like en_core_web_sm, then pass your text to the model to create a Doc object. You can access tokens by iterating over this Doc object, where each token represents a word or punctuation.Syntax
To tokenize text with spaCy, you use a language model to process the text and create a Doc object. Each token in the Doc represents a word, punctuation, or symbol.
import spacy: imports the spaCy library.nlp = spacy.load('en_core_web_sm'): loads the English language model.doc = nlp(text): processes the text and returns aDocobject.for token in doc:: iterates over tokens in theDoc.
python
import spacy nlp = spacy.load('en_core_web_sm') text = "Your text here." doc = nlp(text) for token in doc: print(token.text)
Example
This example shows how to tokenize a simple sentence using spaCy. It prints each token separately, including words and punctuation.
python
import spacy nlp = spacy.load('en_core_web_sm') text = "Hello, world! Let's tokenize this sentence." doc = nlp(text) for token in doc: print(token.text)
Output
Hello
,
world
!
Let
's
tokenize
this
sentence
.
Common Pitfalls
Common mistakes when tokenizing with spaCy include:
- Not loading a language model before processing text, which causes errors.
- Trying to split text manually instead of using spaCy's tokenizer, which misses language rules.
- Assuming tokens are always words; punctuation and spaces are also tokens.
python
import spacy # Wrong: Not loading model # doc = spacy(text) # This will cause an error # Right way: nlp = spacy.load('en_core_web_sm') doc = nlp("This is correct tokenization.") for token in doc: print(token.text)
Output
This
is
correct
tokenization
.
Quick Reference
| Step | Description |
|---|---|
| Import spaCy | import spacy |
| Load model | nlp = spacy.load('en_core_web_sm') |
| Process text | doc = nlp(text) |
| Iterate tokens | for token in doc: print(token.text) |
Key Takeaways
Always load a spaCy language model before tokenizing text.
Tokenization splits text into words, punctuation, and symbols as tokens.
Iterate over the Doc object to access each token's text.
Avoid manual splitting; spaCy handles language-specific rules.
Tokens include punctuation and special characters, not just words.
