from spacy.matcher import Matcher import spacy nlp = spacy.load('en_core_web_sm') matcher = Matcher(nlp.vocab) pattern = [{'ENT_TYPE': '[1]'}] matcher.add('DATE_PATTERN', [pattern]) doc = nlp('We met on January 10th, 2023.') matches = matcher(doc) for match_id, start, end in matches: span = doc[start:end] print(span.text)

import spacy nlp = spacy.load('en_core_web_sm') doc = nlp('Google was founded in September 1998 by Larry Page and Sergey Brin.') entities = [1] for ent in doc.ents: entities[ent.[2]] = ent.[3] print(entities)

Practice

(1/5)

1. What is the main purpose of information extraction patterns in NLP?

easy

A. To automatically find specific facts like names or dates in text

B. To translate text from one language to another

C. To generate new sentences from given words

D. To summarize long documents into short paragraphs

Solution

Step 1: Understand the role of information extraction patterns
These patterns are designed to locate specific pieces of information such as names, dates, or places within text automatically.
Step 2: Compare with other NLP tasks
Translation, generation, and summarization are different NLP tasks and do not focus on extracting facts.
Final Answer:
To automatically find specific facts like names or dates in text -> Option A
Quick Check:
Information extraction = find facts [OK]

Hint: Patterns extract facts, not translate or summarize [OK]

Common Mistakes:

Confusing extraction with translation
Thinking patterns generate new text
Mixing extraction with summarization

2. Which of the following is a correct example of a simple pattern to extract dates in text?

easy

A. \b[A-Z]{2,}\b (matches uppercase words)

B. \b\d{4}-\d{2}-\d{2}\b (matches YYYY-MM-DD format)

C. \w+@\w+\.com (matches email addresses)

D. \d+\s+\w+ (matches any number followed by a word)

Solution

Step 1: Identify the pattern for dates
The pattern \b\d{4}-\d{2}-\d{2}\b matches a 4-digit year, 2-digit month, and 2-digit day separated by dashes, which is a common date format.
Step 2: Check other options
\d+\s+\w+ (matches any number followed by a word) matches number + word but is too general; C matches emails; A matches uppercase words, not dates.
Final Answer:
\b\d{4}-\d{2}-\d{2}\b (matches YYYY-MM-DD format) -> Option B
Quick Check:
Date pattern = \b\d{4}-\d{2}-\d{2}\b (matches YYYY-MM-DD format) [OK]

Hint: Look for year-month-day format in regex [OK]

Common Mistakes:

Choosing patterns that match emails or words instead of dates
Ignoring word boundaries \b in regex
Confusing number patterns with date formats

3. Given this pattern to extract person names: \b(Mr|Ms|Dr)\.\s+[A-Z][a-z]+\b, what will be the output when applied to the text: "Dr. Smith and Mr. Johnson went to the park."?

medium

A. ["Dr", "Mr"]

B. ["Smith", "Johnson"]

C. ["Dr. Smith", "Mr. Johnson"]

D. [] (empty list)

Solution

Step 1: Understand the regex pattern
The pattern matches titles (Mr, Ms, Dr) followed by a dot, a space, and a capitalized last name.
Step 2: Apply pattern to the text
In the text, "Dr. Smith" and "Mr. Johnson" both match the pattern exactly.
Final Answer:
["Dr. Smith", "Mr. Johnson"] -> Option C
Quick Check:
Pattern matches title + name = ["Dr. Smith", "Mr. Johnson"] [OK]

Hint: Match title + dot + space + capitalized name [OK]

Common Mistakes:

Extracting only last names without titles
Extracting only titles without names
Getting empty results due to pattern mismatch

4. Identify the error in this pattern meant to extract email addresses: \b[\w.-]+@[\w.-]+\b

medium

A. It misses the domain extension like .com or .org

B. It uses incorrect character classes for emails

C. It does not match the '@' symbol

D. It matches only uppercase letters

Solution

Step 1: Analyze the pattern components
The pattern matches word characters, dots, or dashes before and after '@', but stops at word boundary without requiring domain extensions like '.com'.
Step 2: Identify missing part
Valid emails usually end with a domain extension (e.g., '.com'), which this pattern does not enforce, so it may match incomplete emails.
Final Answer:
It misses the domain extension like .com or .org -> Option A
Quick Check:
Email pattern missing domain extension = It misses the domain extension like .com or .org [OK]

Hint: Check if pattern includes domain extensions like .com [OK]

Common Mistakes:

Assuming '@' is not matched
Thinking character classes are wrong
Ignoring domain extension importance

5. You want to extract locations from text using patterns that match city names followed by state abbreviations, like "Austin TX" or "Denver CO". Which pattern best fits this task?

hard

A. \b\w+@\w+\.com\b (email addresses)

B. \b\d{5}\b (five digit numbers)

C. \b[A-Z]{2,}\b (two or more uppercase letters only)

D. \b[A-Z][a-z]+\s+[A-Z]{2}\b (capitalized city name + space + two uppercase letters)

Solution

Step 1: Understand the location format
Locations are city names starting with a capital letter followed by a two-letter uppercase state abbreviation.
Step 2: Match pattern to format
Pattern \b[A-Z][a-z]+\s+[A-Z]{2}\b matches a capitalized word, a space, then exactly two uppercase letters, fitting the example.
Final Answer:
\b[A-Z][a-z]+\s+[A-Z]{2}\b (capitalized city name + space + two uppercase letters) -> Option D
Quick Check:
City + state abbreviation pattern = \b[A-Z][a-z]+\s+[A-Z]{2}\b (capitalized city name + space + two uppercase letters) [OK]

Hint: City capitalized + space + 2 uppercase letters [OK]

Common Mistakes:

Choosing patterns for zip codes or emails
Matching only uppercase words without city name
Ignoring space between city and state

Information extraction patterns in NLP - Interactive Code Practice

Start learning this pattern below

Practice

Solution

Step 1: Understand the role of information extraction patterns

Step 2: Compare with other NLP tasks

Final Answer:

Quick Check:

Solution

Step 1: Identify the pattern for dates

Step 2: Check other options

Final Answer:

Quick Check:

Solution

Step 1: Understand the regex pattern

Step 2: Apply pattern to the text

Final Answer:

Quick Check:

Solution

Step 1: Analyze the pattern components

Step 2: Identify missing part

Final Answer:

Quick Check:

Solution

Step 1: Understand the location format

Step 2: Match pattern to format

Final Answer:

Quick Check: