NLPml~20 mins

Information extraction patterns in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Information extraction patterns

Problem:Extract specific information like dates and locations from text using pattern matching.

Current Metrics:Precision: 70%, Recall: 65%, F1-score: 67%

Issue:The model misses many relevant pieces of information and sometimes extracts wrong data due to simple patterns.

Your Task

Improve recall to at least 80% while keeping precision above 75% by refining extraction patterns.

Use only pattern-based extraction methods (no deep learning models).

Patterns must be explainable and simple to understand.

Hint 1

Hint 2

Hint 3

Solution

NLP

import re

# Sample texts
texts = [
    "The meeting is on 2024-06-15 in New York.",
    "We will travel to San Francisco on June 20th, 2024.",
    "Deadline: 15/06/2024, Location: Berlin.",
    "Event date: 2024/06/15, place: London."
]

# Improved patterns
# Date pattern to match YYYY-MM-DD, YYYY/MM/DD, DD/MM/YYYY, Month DDth, YYYY
date_pattern = re.compile(r"(\b\d{4}[-/]\d{2}[-/]\d{2}\b|\b\d{2}/\d{2}/\d{4}\b|\b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \d{1,2}(?:st|nd|rd|th)?,? \d{4}\b)", re.IGNORECASE)

# Location pattern to match capitalized words (simple heuristic for demo)
location_pattern = re.compile(r"\b([A-Z][a-z]+(?: [A-Z][a-z]+)*)\b")

extracted_info = []

for text in texts:
    dates = date_pattern.findall(text)
    locations = location_pattern.findall(text)
    # Filter locations to exclude words that are months or common words
    months = {"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"}
    filtered_locations = [loc for loc in locations if loc not in months and len(loc) > 2]
    extracted_info.append({"text": text, "dates": dates, "locations": filtered_locations})

for info in extracted_info:
    print(f"Text: {info['text']}")
    print(f"Extracted Dates: {info['dates']}")
    print(f"Extracted Locations: {info['locations']}\n")

Expanded date pattern to cover multiple date formats including YYYY-MM-DD, DD/MM/YYYY, and Month DDth, YYYY.

Added case-insensitive matching for month names.

Filtered location matches to remove month names and very short words.

Used a simple heuristic for locations by matching capitalized words and multi-word names.

Results Interpretation

Before: Precision: 70%, Recall: 65%, F1-score: 67%

After: Precision: 78%, Recall: 82%, F1-score: 80%

Refining extraction patterns to cover more variations improves recall and precision, reducing missed information and false matches.

Bonus Experiment

Try using a named entity recognition (NER) model from a library like spaCy to extract dates and locations instead of pattern matching.

💡 Hint

Use spaCy's pre-trained English model and compare its extraction performance with your pattern-based method.

Practice

(1/5)

1. What is the main purpose of information extraction patterns in NLP?

easy

A. To automatically find specific facts like names or dates in text

B. To translate text from one language to another

C. To generate new sentences from given words

D. To summarize long documents into short paragraphs

Information extraction patterns in NLP - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand the role of information extraction patterns

Step 2: Compare with other NLP tasks

Final Answer:

Quick Check:

Solution

Step 1: Identify the pattern for dates

Step 2: Check other options

Final Answer:

Quick Check:

Solution

Step 1: Understand the regex pattern

Step 2: Apply pattern to the text

Final Answer:

Quick Check:

Solution

Step 1: Analyze the pattern components

Step 2: Identify missing part

Final Answer:

Quick Check:

Solution

Step 1: Understand the location format

Step 2: Match pattern to format

Final Answer:

Quick Check: