NER is used to find specific pieces of information in text, like names or dates. Why is this considered extracting structured information?
Think about how NER tags parts of text with labels that computers can understand easily.
NER tags words or phrases with categories like 'person' or 'date'. This labeling turns messy text into organized data, which is structured information.
What is the output of this Python code using spaCy to extract entities?
import spacy nlp = spacy.load('en_core_web_sm') doc = nlp('Apple was founded by Steve Jobs in California.') entities = [(ent.text, ent.label_) for ent in doc.ents] print(entities)
Remember that 'Apple' is a company (organization), 'Steve Jobs' is a person, and 'California' is a geopolitical entity.
spaCy labels 'Apple' as an organization (ORG), 'Steve Jobs' as a person (PERSON), and 'California' as a geopolitical entity (GPE).
You want to extract structured information from tweets that contain slang, misspellings, and emojis. Which model is best suited for this NER task?
Consider which model can understand context and adapt to informal language.
BERT models fine-tuned on social media data can handle slang and misspellings better than fixed rules or simple models.
An NER model predicted 80 entities correctly, missed 20 entities, and predicted 10 entities incorrectly. What is the F1 score?
Calculate precision and recall first, then use F1 = 2 * (precision * recall) / (precision + recall).
Precision = 80 / (80 + 10) = 0.8889, Recall = 80 / (80 + 20) = 0.8, F1 = 2 * 0.8889 * 0.8 / (0.8889 + 0.8) ≈ 0.842.
You trained an NER model on news articles but it performs poorly on medical reports. What is the most likely reason?
Think about how domain differences affect model understanding.
Models trained on one domain often fail on very different domains because vocabulary and entity types differ, causing poor generalization.