Bird
Raised Fist0
NLPml~5 mins

Information extraction patterns in NLP

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Introduction

Information extraction patterns help computers find useful facts from text. They make it easier to pick out names, dates, or places automatically.

You want to find all names of people mentioned in news articles.
You need to extract dates and times from emails to schedule meetings.
You want to pull out product names and prices from online reviews.
You want to identify locations mentioned in travel blogs.
You want to organize large documents by extracting key facts like company names or events.
Syntax
NLP
pattern = [{'LOWER': 'name'}, {'IS_PUNCT': True, 'OP': '?'}, {'ENT_TYPE': 'PERSON'}]

Patterns are lists of dictionaries describing word features.

Each dictionary can check word text, punctuation, or entity types.

Examples
This pattern finds phrases like 'born 1990 in Paris'.
NLP
pattern = [{'LOWER': 'born'}, {'IS_DIGIT': True}, {'LOWER': 'in'}, {'ENT_TYPE': 'GPE'}]
This finds organization names followed by the word 'headquarters'.
NLP
pattern = [{'ENT_TYPE': 'ORG'}, {'LOWER': 'headquarters'}]
This matches one or more words followed by 'street', useful for addresses.
NLP
pattern = [{'IS_ALPHA': True, 'OP': '+'}, {'LOWER': 'street'}]
Sample Model

This program uses a pattern to find phrases like 'born 1879 in Ulm' in text. It prints each matched phrase.

NLP
import spacy
from spacy.matcher import Matcher

# Load small English model
nlp = spacy.load('en_core_web_sm')

# Create matcher object
matcher = Matcher(nlp.vocab)

# Define pattern to find 'born' followed by a year and a place
pattern = [
    {'LOWER': 'born'},
    {'IS_DIGIT': True},
    {'LOWER': 'in'},
    {'ENT_TYPE': 'GPE'}
]

matcher.add('BORN_PATTERN', [pattern])

text = "Albert Einstein was born 1879 in Ulm. Marie Curie was born 1867 in Warsaw."

# Process text
doc = nlp(text)

# Find matches
matches = matcher(doc)

# Print matched spans
for match_id, start, end in matches:
    span = doc[start:end]
    print(f"Matched phrase: '{span.text}'")
OutputSuccess
Important Notes

Patterns are case sensitive unless you use 'LOWER' to match lowercase words.

Use 'OP' to specify how many times a pattern part can repeat (e.g., '?', '+').

Patterns work best with a good language model that recognizes entities like people or places.

Summary

Information extraction patterns help find specific facts in text automatically.

They use lists of word features to describe what to look for.

Patterns can find names, dates, places, and more by matching text and entity types.

Practice

(1/5)
1. What is the main purpose of information extraction patterns in NLP?
easy
A. To automatically find specific facts like names or dates in text
B. To translate text from one language to another
C. To generate new sentences from given words
D. To summarize long documents into short paragraphs

Solution

  1. Step 1: Understand the role of information extraction patterns

    These patterns are designed to locate specific pieces of information such as names, dates, or places within text automatically.
  2. Step 2: Compare with other NLP tasks

    Translation, generation, and summarization are different NLP tasks and do not focus on extracting facts.
  3. Final Answer:

    To automatically find specific facts like names or dates in text -> Option A
  4. Quick Check:

    Information extraction = find facts [OK]
Hint: Patterns extract facts, not translate or summarize [OK]
Common Mistakes:
  • Confusing extraction with translation
  • Thinking patterns generate new text
  • Mixing extraction with summarization
2. Which of the following is a correct example of a simple pattern to extract dates in text?
easy
A. \b[A-Z]{2,}\b (matches uppercase words)
B. \b\d{4}-\d{2}-\d{2}\b (matches YYYY-MM-DD format)
C. \w+@\w+\.com (matches email addresses)
D. \d+\s+\w+ (matches any number followed by a word)

Solution

  1. Step 1: Identify the pattern for dates

    The pattern \b\d{4}-\d{2}-\d{2}\b matches a 4-digit year, 2-digit month, and 2-digit day separated by dashes, which is a common date format.
  2. Step 2: Check other options

    \d+\s+\w+ (matches any number followed by a word) matches number + word but is too general; C matches emails; A matches uppercase words, not dates.
  3. Final Answer:

    \b\d{4}-\d{2}-\d{2}\b (matches YYYY-MM-DD format) -> Option B
  4. Quick Check:

    Date pattern = \b\d{4}-\d{2}-\d{2}\b (matches YYYY-MM-DD format) [OK]
Hint: Look for year-month-day format in regex [OK]
Common Mistakes:
  • Choosing patterns that match emails or words instead of dates
  • Ignoring word boundaries \b in regex
  • Confusing number patterns with date formats
3. Given this pattern to extract person names: \b(Mr|Ms|Dr)\.\s+[A-Z][a-z]+\b, what will be the output when applied to the text: "Dr. Smith and Mr. Johnson went to the park."?
medium
A. ["Dr", "Mr"]
B. ["Smith", "Johnson"]
C. ["Dr. Smith", "Mr. Johnson"]
D. [] (empty list)

Solution

  1. Step 1: Understand the regex pattern

    The pattern matches titles (Mr, Ms, Dr) followed by a dot, a space, and a capitalized last name.
  2. Step 2: Apply pattern to the text

    In the text, "Dr. Smith" and "Mr. Johnson" both match the pattern exactly.
  3. Final Answer:

    ["Dr. Smith", "Mr. Johnson"] -> Option C
  4. Quick Check:

    Pattern matches title + name = ["Dr. Smith", "Mr. Johnson"] [OK]
Hint: Match title + dot + space + capitalized name [OK]
Common Mistakes:
  • Extracting only last names without titles
  • Extracting only titles without names
  • Getting empty results due to pattern mismatch
4. Identify the error in this pattern meant to extract email addresses: \b[\w.-]+@[\w.-]+\b
medium
A. It misses the domain extension like .com or .org
B. It uses incorrect character classes for emails
C. It does not match the '@' symbol
D. It matches only uppercase letters

Solution

  1. Step 1: Analyze the pattern components

    The pattern matches word characters, dots, or dashes before and after '@', but stops at word boundary without requiring domain extensions like '.com'.
  2. Step 2: Identify missing part

    Valid emails usually end with a domain extension (e.g., '.com'), which this pattern does not enforce, so it may match incomplete emails.
  3. Final Answer:

    It misses the domain extension like .com or .org -> Option A
  4. Quick Check:

    Email pattern missing domain extension = It misses the domain extension like .com or .org [OK]
Hint: Check if pattern includes domain extensions like .com [OK]
Common Mistakes:
  • Assuming '@' is not matched
  • Thinking character classes are wrong
  • Ignoring domain extension importance
5. You want to extract locations from text using patterns that match city names followed by state abbreviations, like "Austin TX" or "Denver CO". Which pattern best fits this task?
hard
A. \b\w+@\w+\.com\b (email addresses)
B. \b\d{5}\b (five digit numbers)
C. \b[A-Z]{2,}\b (two or more uppercase letters only)
D. \b[A-Z][a-z]+\s+[A-Z]{2}\b (capitalized city name + space + two uppercase letters)

Solution

  1. Step 1: Understand the location format

    Locations are city names starting with a capital letter followed by a two-letter uppercase state abbreviation.
  2. Step 2: Match pattern to format

    Pattern \b[A-Z][a-z]+\s+[A-Z]{2}\b matches a capitalized word, a space, then exactly two uppercase letters, fitting the example.
  3. Final Answer:

    \b[A-Z][a-z]+\s+[A-Z]{2}\b (capitalized city name + space + two uppercase letters) -> Option D
  4. Quick Check:

    City + state abbreviation pattern = \b[A-Z][a-z]+\s+[A-Z]{2}\b (capitalized city name + space + two uppercase letters) [OK]
Hint: City capitalized + space + 2 uppercase letters [OK]
Common Mistakes:
  • Choosing patterns for zip codes or emails
  • Matching only uppercase words without city name
  • Ignoring space between city and state