NlpHow-ToBeginner · 3 min read

How to Use spaCy Matcher in NLP: Syntax and Examples

Use the Matcher class from spaCy to find patterns in text by defining token patterns and adding them to the matcher. Then apply the matcher to a processed Doc object to get matching spans.

📐

Syntax

The Matcher in spaCy requires these steps:

Import and create a Matcher object with the language model's vocabulary.
Define patterns as lists of dictionaries specifying token attributes.
Add patterns to the matcher with a unique ID.
Call the matcher on a Doc object to get matches.

python

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)

pattern = [{'LOWER': 'hello'}, {'IS_PUNCT': True, 'OP': '?'}]
matcher.add('HELLO_PATTERN', [pattern])

doc = nlp('Hello! How are you?')
matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]
    print(match_id, span.text)

Output

139957943054232 Hello!

💻

Example

This example shows how to find the phrase "New York" in text using spaCy's Matcher by defining a pattern for two consecutive proper nouns.

python

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)

pattern = [{'POS': 'PROPN'}, {'POS': 'PROPN'}]
matcher.add('PROPER_NOUN_PAIR', [pattern])

doc = nlp('I visited New York last summer.')
matches = matcher(doc)

for match_id, start, end in matches:
    span = doc[start:end]
    print(f'Match found: {span.text}')

Output

Match found: New York

⚠️

Common Pitfalls

Common mistakes when using spaCy Matcher include:

Not loading the language model before creating the matcher.
Using incorrect token attribute keys or values in patterns.
Forgetting to add patterns to the matcher before calling it.
Not handling multiple matches or overlapping spans properly.

Always check the spaCy documentation for valid token attributes.

python

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)

# Wrong pattern key 'LOWERCASE' instead of 'LOWER'
wrong_pattern = [{'LOWERCASE': 'hello'}]

# Correct pattern
correct_pattern = [{'LOWER': 'hello'}]

# Adding correct pattern
matcher.add('HELLO', [correct_pattern])

doc = nlp('Hello there!')
matches = matcher(doc)
for match_id, start, end in matches:
    print(doc[start:end].text)

Output

Hello

📊

Quick Reference

Key points to remember when using spaCy Matcher:

Matcher creation: matcher = Matcher(nlp.vocab)
Pattern format: List of dicts with token attributes like LOWER, POS, IS_PUNCT
Adding patterns: matcher.add('ID', [pattern])
Using matcher: matches = matcher(doc)
Match output: Tuples of (match_id, start, end) indexes in the Doc

✅

Key Takeaways

Create a Matcher with the language model's vocabulary before adding patterns.

Define patterns as lists of token attribute dictionaries to specify what to match.

Add patterns to the matcher with a unique ID before running it on text.

Call the matcher on a processed Doc to get match spans with start and end positions.

Check token attribute keys carefully to avoid common pattern mistakes.