How to Use spaCy Matcher in NLP: Syntax and Examples
Use the
Matcher class from spaCy to find patterns in text by defining token patterns and adding them to the matcher. Then apply the matcher to a processed Doc object to get matching spans.Syntax
The Matcher in spaCy requires these steps:
- Import and create a
Matcherobject with the language model's vocabulary. - Define patterns as lists of dictionaries specifying token attributes.
- Add patterns to the matcher with a unique ID.
- Call the matcher on a
Docobject to get matches.
python
import spacy from spacy.matcher import Matcher nlp = spacy.load('en_core_web_sm') matcher = Matcher(nlp.vocab) pattern = [{'LOWER': 'hello'}, {'IS_PUNCT': True, 'OP': '?'}] matcher.add('HELLO_PATTERN', [pattern]) doc = nlp('Hello! How are you?') matches = matcher(doc) for match_id, start, end in matches: span = doc[start:end] print(match_id, span.text)
Output
139957943054232 Hello!
Example
This example shows how to find the phrase "New York" in text using spaCy's Matcher by defining a pattern for two consecutive proper nouns.
python
import spacy from spacy.matcher import Matcher nlp = spacy.load('en_core_web_sm') matcher = Matcher(nlp.vocab) pattern = [{'POS': 'PROPN'}, {'POS': 'PROPN'}] matcher.add('PROPER_NOUN_PAIR', [pattern]) doc = nlp('I visited New York last summer.') matches = matcher(doc) for match_id, start, end in matches: span = doc[start:end] print(f'Match found: {span.text}')
Output
Match found: New York
Common Pitfalls
Common mistakes when using spaCy Matcher include:
- Not loading the language model before creating the matcher.
- Using incorrect token attribute keys or values in patterns.
- Forgetting to add patterns to the matcher before calling it.
- Not handling multiple matches or overlapping spans properly.
Always check the spaCy documentation for valid token attributes.
python
import spacy from spacy.matcher import Matcher nlp = spacy.load('en_core_web_sm') matcher = Matcher(nlp.vocab) # Wrong pattern key 'LOWERCASE' instead of 'LOWER' wrong_pattern = [{'LOWERCASE': 'hello'}] # Correct pattern correct_pattern = [{'LOWER': 'hello'}] # Adding correct pattern matcher.add('HELLO', [correct_pattern]) doc = nlp('Hello there!') matches = matcher(doc) for match_id, start, end in matches: print(doc[start:end].text)
Output
Hello
Quick Reference
Key points to remember when using spaCy Matcher:
- Matcher creation:
matcher = Matcher(nlp.vocab) - Pattern format: List of dicts with token attributes like
LOWER,POS,IS_PUNCT - Adding patterns:
matcher.add('ID', [pattern]) - Using matcher:
matches = matcher(doc) - Match output: Tuples of
(match_id, start, end)indexes in theDoc
Key Takeaways
Create a Matcher with the language model's vocabulary before adding patterns.
Define patterns as lists of token attribute dictionaries to specify what to match.
Add patterns to the matcher with a unique ID before running it on text.
Call the matcher on a processed Doc to get match spans with start and end positions.
Check token attribute keys carefully to avoid common pattern mistakes.
