Bird
Raised Fist0
NlpHow-ToBeginner · 3 min read

How to Use spaCy Matcher in NLP: Syntax and Examples

Use the Matcher class from spaCy to find patterns in text by defining token patterns and adding them to the matcher. Then apply the matcher to a processed Doc object to get matching spans.
📐

Syntax

The Matcher in spaCy requires these steps:

  • Import and create a Matcher object with the language model's vocabulary.
  • Define patterns as lists of dictionaries specifying token attributes.
  • Add patterns to the matcher with a unique ID.
  • Call the matcher on a Doc object to get matches.
python
import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)

pattern = [{'LOWER': 'hello'}, {'IS_PUNCT': True, 'OP': '?'}]
matcher.add('HELLO_PATTERN', [pattern])

doc = nlp('Hello! How are you?')
matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]
    print(match_id, span.text)
Output
139957943054232 Hello!
💻

Example

This example shows how to find the phrase "New York" in text using spaCy's Matcher by defining a pattern for two consecutive proper nouns.

python
import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)

pattern = [{'POS': 'PROPN'}, {'POS': 'PROPN'}]
matcher.add('PROPER_NOUN_PAIR', [pattern])

doc = nlp('I visited New York last summer.')
matches = matcher(doc)

for match_id, start, end in matches:
    span = doc[start:end]
    print(f'Match found: {span.text}')
Output
Match found: New York
⚠️

Common Pitfalls

Common mistakes when using spaCy Matcher include:

  • Not loading the language model before creating the matcher.
  • Using incorrect token attribute keys or values in patterns.
  • Forgetting to add patterns to the matcher before calling it.
  • Not handling multiple matches or overlapping spans properly.

Always check the spaCy documentation for valid token attributes.

python
import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)

# Wrong pattern key 'LOWERCASE' instead of 'LOWER'
wrong_pattern = [{'LOWERCASE': 'hello'}]

# Correct pattern
correct_pattern = [{'LOWER': 'hello'}]

# Adding correct pattern
matcher.add('HELLO', [correct_pattern])

doc = nlp('Hello there!')
matches = matcher(doc)
for match_id, start, end in matches:
    print(doc[start:end].text)
Output
Hello
📊

Quick Reference

Key points to remember when using spaCy Matcher:

  • Matcher creation: matcher = Matcher(nlp.vocab)
  • Pattern format: List of dicts with token attributes like LOWER, POS, IS_PUNCT
  • Adding patterns: matcher.add('ID', [pattern])
  • Using matcher: matches = matcher(doc)
  • Match output: Tuples of (match_id, start, end) indexes in the Doc

Key Takeaways

Create a Matcher with the language model's vocabulary before adding patterns.
Define patterns as lists of token attribute dictionaries to specify what to match.
Add patterns to the matcher with a unique ID before running it on text.
Call the matcher on a processed Doc to get match spans with start and end positions.
Check token attribute keys carefully to avoid common pattern mistakes.