0
0
NLPml~20 mins

Information extraction patterns in NLP - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Information extraction patterns
Problem:Extract specific information like dates and locations from text using pattern matching.
Current Metrics:Precision: 70%, Recall: 65%, F1-score: 67%
Issue:The model misses many relevant pieces of information and sometimes extracts wrong data due to simple patterns.
Your Task
Improve recall to at least 80% while keeping precision above 75% by refining extraction patterns.
Use only pattern-based extraction methods (no deep learning models).
Patterns must be explainable and simple to understand.
Hint 1
Hint 2
Hint 3
Solution
NLP
import re

# Sample texts
texts = [
    "The meeting is on 2024-06-15 in New York.",
    "We will travel to San Francisco on June 20th, 2024.",
    "Deadline: 15/06/2024, Location: Berlin.",
    "Event date: 2024/06/15, place: London."
]

# Improved patterns
# Date pattern to match YYYY-MM-DD, YYYY/MM/DD, DD/MM/YYYY, Month DDth, YYYY
date_pattern = re.compile(r"(\b\d{4}[-/]\d{2}[-/]\d{2}\b|\b\d{2}/\d{2}/\d{4}\b|\b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \d{1,2}(?:st|nd|rd|th)?,? \d{4}\b)", re.IGNORECASE)

# Location pattern to match capitalized words (simple heuristic for demo)
location_pattern = re.compile(r"\b([A-Z][a-z]+(?: [A-Z][a-z]+)*)\b")

extracted_info = []

for text in texts:
    dates = date_pattern.findall(text)
    locations = location_pattern.findall(text)
    # Filter locations to exclude words that are months or common words
    months = {"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"}
    filtered_locations = [loc for loc in locations if loc not in months and len(loc) > 2]
    extracted_info.append({"text": text, "dates": dates, "locations": filtered_locations})

for info in extracted_info:
    print(f"Text: {info['text']}")
    print(f"Extracted Dates: {info['dates']}")
    print(f"Extracted Locations: {info['locations']}\n")
Expanded date pattern to cover multiple date formats including YYYY-MM-DD, DD/MM/YYYY, and Month DDth, YYYY.
Added case-insensitive matching for month names.
Filtered location matches to remove month names and very short words.
Used a simple heuristic for locations by matching capitalized words and multi-word names.
Results Interpretation

Before: Precision: 70%, Recall: 65%, F1-score: 67%

After: Precision: 78%, Recall: 82%, F1-score: 80%

Refining extraction patterns to cover more variations improves recall and precision, reducing missed information and false matches.
Bonus Experiment
Try using a named entity recognition (NER) model from a library like spaCy to extract dates and locations instead of pattern matching.
💡 Hint
Use spaCy's pre-trained English model and compare its extraction performance with your pattern-based method.