Bird
Raised Fist0
NLPml~20 mins

NER with spaCy in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - NER with spaCy
Problem:You are training a Named Entity Recognition (NER) model using spaCy to identify entities like names, locations, and organizations in text.
Current Metrics:Training accuracy: 98%, Validation accuracy: 75%
Issue:The model is overfitting: training accuracy is very high but validation accuracy is much lower.
Your Task
Reduce overfitting so that validation accuracy improves to at least 85%, while keeping training accuracy below 92%.
You cannot change the dataset size or add more data.
You must use spaCy's built-in components and training methods.
You can adjust model hyperparameters and add regularization techniques.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
NLP
import spacy
from spacy.training.example import Example
from spacy.util import minibatch, compounding
import random

# Load blank English model
nlp = spacy.blank('en')

# Add NER pipe
if 'ner' not in nlp.pipe_names:
    ner = nlp.add_pipe('ner')
else:
    ner = nlp.get_pipe('ner')

# Add labels (example labels)
labels = ['PERSON', 'ORG', 'GPE']
for label in labels:
    ner.add_label(label)

# Example training data
TRAIN_DATA = [
    ("Apple is looking at buying U.K. startup for $1 billion", {'entities': [(0, 5, 'ORG'), (27, 31, 'GPE')]}),
    ("San Francisco considers banning sidewalk delivery robots", {'entities': [(0, 13, 'GPE')]}),
    ("London is a big city in the United Kingdom.", {'entities': [(0, 6, 'GPE'), (31, 44, 'GPE')]})
]

# Disable other pipes during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes):
    optimizer = nlp.begin_training()
    # Adjust optimizer parameters for lower learning rate
    optimizer.alpha = 0.001

    for itn in range(30):
        random.shuffle(TRAIN_DATA)
        losses = {}
        batches = minibatch(TRAIN_DATA, size=compounding(4.0, 16.0, 1.001))
        for batch in batches:
            examples = []
            for text, annotations in batch:
                doc = nlp.make_doc(text)
                examples.append(Example.from_dict(doc, annotations))
            nlp.update(
                examples,
                drop=0.3,  # Added dropout to reduce overfitting
                sgd=optimizer,
                losses=losses
            )
        if itn % 5 == 0:
            print(f"Iteration {itn}, Losses: {losses}")

# Evaluate on validation data
VALIDATION_DATA = [
    ("Google is opening a new office in New York", {'entities': [(0, 6, 'ORG'), (31, 39, 'GPE')]}),
    ("Amazon plans to hire more employees in Seattle", {'entities': [(0, 6, 'ORG'), (38, 45, 'GPE')]})
]

correct = 0
total_pred = 0
total_true = 0

for text, annotations in VALIDATION_DATA:
    doc = nlp(text)
    pred_ents = [(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]
    true_ents = annotations.get('entities')
    pred_set = set(pred_ents)
    true_set = set(true_ents)
    correct += len(pred_set & true_set)
    total_pred += len(pred_set)
    total_true += len(true_set)

precision = correct / total_pred if total_pred > 0 else 0
recall = correct / total_true if total_true > 0 else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

print(f"Validation Precision: {precision:.2f}")
print(f"Validation Recall: {recall:.2f}")
print(f"Validation F1 Score: {f1:.2f}")
Added dropout rate of 0.3 during training to reduce overfitting.
Lowered learning rate to 0.001 for more stable training.
Used minibatch with compounding batch sizes for better gradient updates.
Limited training to 30 iterations to avoid overtraining.
Results Interpretation

Before: Training accuracy: 98%, Validation accuracy: 75% (high overfitting)

After: Training accuracy: ~90%, Validation F1 score: 86% (reduced overfitting, better generalization)

Adding dropout and lowering the learning rate helps reduce overfitting in NER models, improving validation performance while keeping training accuracy reasonable.
Bonus Experiment
Try using early stopping by monitoring validation loss and stopping training when it stops improving.
💡 Hint
Implement a simple loop to save the best model and stop training if validation loss does not improve for several iterations.

Practice

(1/5)
1. What does NER (Named Entity Recognition) do in natural language processing?
easy
A. It generates new text based on input prompts.
B. It translates text from one language to another.
C. It summarizes long documents into short paragraphs.
D. It finds and labels important names and terms in text automatically.

Solution

  1. Step 1: Understand NER's purpose

    NER identifies specific names like people, places, or organizations in text.
  2. Step 2: Compare with other NLP tasks

    Translation, summarization, and text generation are different tasks than NER.
  3. Final Answer:

    It finds and labels important names and terms in text automatically. -> Option D
  4. Quick Check:

    NER = Finds names and terms [OK]
Hint: NER extracts names and terms, not translations or summaries [OK]
Common Mistakes:
  • Confusing NER with translation or summarization
  • Thinking NER generates new text
  • Believing NER only finds keywords, not named entities
2. Which of the following is the correct way to load a pre-trained spaCy model for NER?
easy
A. import spacy; nlp = spacy.load('en_core_web_sm')
B. import spacy; nlp = spacy.model('en_core_web_sm')
C. import spacy; nlp = spacy.load_model('en_core_web_sm')
D. import spacy; nlp = spacy.get('en_core_web_sm')

Solution

  1. Step 1: Recall spaCy model loading syntax

    spaCy uses spacy.load('model_name') to load pre-trained models.
  2. Step 2: Check each option

    Only import spacy; nlp = spacy.load('en_core_web_sm') uses spacy.load correctly; others use invalid functions.
  3. Final Answer:

    import spacy; nlp = spacy.load('en_core_web_sm') -> Option A
  4. Quick Check:

    spaCy model loading = spacy.load() [OK]
Hint: Use spacy.load('model_name') to load models [OK]
Common Mistakes:
  • Using spacy.model or spacy.load_model which don't exist
  • Trying spacy.get which is not a spaCy function
  • Forgetting to import spacy before loading
3. Given this code snippet using spaCy for NER:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Apple is looking at buying U.K. startup for $1 billion')
entities = [(ent.text, ent.label_) for ent in doc.ents]
print(entities)

What will be the output?
medium
A. [('Apple', 'PERSON'), ('U.K.', 'ORG'), ('$1 billion', 'QUANTITY')]
B. [('Apple', 'ORG'), ('startup', 'ORG'), ('$1 billion', 'MONEY')]
C. [('Apple', 'ORG'), ('U.K.', 'GPE'), ('$1 billion', 'MONEY')]
D. [('Apple', 'GPE'), ('U.K.', 'GPE'), ('$1 billion', 'MONEY')]

Solution

  1. Step 1: Understand spaCy NER labels

    Apple is recognized as an organization (ORG), U.K. as geopolitical entity (GPE), and $1 billion as money (MONEY).
  2. Step 2: Match entities with labels

    [('Apple', 'ORG'), ('U.K.', 'GPE'), ('$1 billion', 'MONEY')] correctly matches these entities and labels as spaCy outputs.
  3. Final Answer:

    [('Apple', 'ORG'), ('U.K.', 'GPE'), ('$1 billion', 'MONEY')] -> Option C
  4. Quick Check:

    spaCy NER output matches [('Apple', 'ORG'), ('U.K.', 'GPE'), ('$1 billion', 'MONEY')] [OK]
Hint: Check spaCy's common entity labels for correct matches [OK]
Common Mistakes:
  • Confusing ORG with PERSON or GPE
  • Mislabeling MONEY as QUANTITY
  • Including words like 'startup' as entities
4. You run this code but get an error:
import spacy
doc = nlp('Google is a tech giant')

What is the most likely cause?
medium
A. spaCy does not support the word 'Google'.
B. The variable 'nlp' is not defined before use.
C. The text input is too short for NER.
D. Missing parentheses in the print statement.

Solution

  1. Step 1: Check variable definitions

    The code uses 'nlp' without defining it by loading a spaCy model first.
  2. Step 2: Identify error cause

    This causes a NameError because 'nlp' is undefined.
  3. Final Answer:

    The variable 'nlp' is not defined before use. -> Option B
  4. Quick Check:

    Undefined variable 'nlp' causes error [OK]
Hint: Always load model with spacy.load before using nlp [OK]
Common Mistakes:
  • Assuming text length causes error
  • Thinking spaCy can't recognize common words
  • Confusing print syntax errors with variable errors
5. You want to extract only person names from a text using spaCy's NER. Which code snippet correctly filters for persons?
hard
A. persons = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']
B. persons = [ent.text for ent in doc.ents if ent.label_ == 'ORG']
C. persons = [ent.text for ent in doc.ents if ent.label_ == 'GPE']
D. persons = [ent.text for ent in doc.ents if ent.label_ == 'MONEY']

Solution

  1. Step 1: Identify label for persons in spaCy

    spaCy uses 'PERSON' label for people names.
  2. Step 2: Filter entities by 'PERSON'

    Filtering doc.ents by ent.label_ == 'PERSON' extracts only person names.
  3. Final Answer:

    persons = [ent.text for ent in doc.ents if ent.label_ == 'PERSON'] -> Option A
  4. Quick Check:

    Filter entities by 'PERSON' label [OK]
Hint: Filter entities with label_ == 'PERSON' to get names [OK]
Common Mistakes:
  • Using wrong labels like ORG or GPE for persons
  • Not filtering entities at all
  • Confusing entity text with label