NLPml~20 mins

NER with spaCy in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - NER with spaCy

Problem:You are training a Named Entity Recognition (NER) model using spaCy to identify entities like names, locations, and organizations in text.

Current Metrics:Training accuracy: 98%, Validation accuracy: 75%

Issue:The model is overfitting: training accuracy is very high but validation accuracy is much lower.

Your Task

Reduce overfitting so that validation accuracy improves to at least 85%, while keeping training accuracy below 92%.

You cannot change the dataset size or add more data.

You must use spaCy's built-in components and training methods.

You can adjust model hyperparameters and add regularization techniques.

Hint 1

Hint 2

Hint 3

Hint 4

Solution

NLP

import spacy
from spacy.training.example import Example
from spacy.util import minibatch, compounding
import random

# Load blank English model
nlp = spacy.blank('en')

# Add NER pipe
if 'ner' not in nlp.pipe_names:
    ner = nlp.add_pipe('ner')
else:
    ner = nlp.get_pipe('ner')

# Add labels (example labels)
labels = ['PERSON', 'ORG', 'GPE']
for label in labels:
    ner.add_label(label)

# Example training data
TRAIN_DATA = [
    ("Apple is looking at buying U.K. startup for $1 billion", {'entities': [(0, 5, 'ORG'), (27, 31, 'GPE')]}),
    ("San Francisco considers banning sidewalk delivery robots", {'entities': [(0, 13, 'GPE')]}),
    ("London is a big city in the United Kingdom.", {'entities': [(0, 6, 'GPE'), (31, 44, 'GPE')]})
]

# Disable other pipes during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes):
    optimizer = nlp.begin_training()
    # Adjust optimizer parameters for lower learning rate
    optimizer.alpha = 0.001

    for itn in range(30):
        random.shuffle(TRAIN_DATA)
        losses = {}
        batches = minibatch(TRAIN_DATA, size=compounding(4.0, 16.0, 1.001))
        for batch in batches:
            examples = []
            for text, annotations in batch:
                doc = nlp.make_doc(text)
                examples.append(Example.from_dict(doc, annotations))
            nlp.update(
                examples,
                drop=0.3,  # Added dropout to reduce overfitting
                sgd=optimizer,
                losses=losses
            )
        if itn % 5 == 0:
            print(f"Iteration {itn}, Losses: {losses}")

# Evaluate on validation data
VALIDATION_DATA = [
    ("Google is opening a new office in New York", {'entities': [(0, 6, 'ORG'), (31, 39, 'GPE')]}),
    ("Amazon plans to hire more employees in Seattle", {'entities': [(0, 6, 'ORG'), (38, 45, 'GPE')]})
]

correct = 0
total_pred = 0
total_true = 0

for text, annotations in VALIDATION_DATA:
    doc = nlp(text)
    pred_ents = [(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]
    true_ents = annotations.get('entities')
    pred_set = set(pred_ents)
    true_set = set(true_ents)
    correct += len(pred_set & true_set)
    total_pred += len(pred_set)
    total_true += len(true_set)

precision = correct / total_pred if total_pred > 0 else 0
recall = correct / total_true if total_true > 0 else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

print(f"Validation Precision: {precision:.2f}")
print(f"Validation Recall: {recall:.2f}")
print(f"Validation F1 Score: {f1:.2f}")

Added dropout rate of 0.3 during training to reduce overfitting.

Lowered learning rate to 0.001 for more stable training.

Used minibatch with compounding batch sizes for better gradient updates.

Limited training to 30 iterations to avoid overtraining.

Results Interpretation

Before: Training accuracy: 98%, Validation accuracy: 75% (high overfitting)

After: Training accuracy: ~90%, Validation F1 score: 86% (reduced overfitting, better generalization)

Adding dropout and lowering the learning rate helps reduce overfitting in NER models, improving validation performance while keeping training accuracy reasonable.

Bonus Experiment

Try using early stopping by monitoring validation loss and stopping training when it stops improving.

💡 Hint

Implement a simple loop to save the best model and stop training if validation loss does not improve for several iterations.

Practice

(1/5)

1. What does NER (Named Entity Recognition) do in natural language processing?

easy

A. It generates new text based on input prompts.

B. It translates text from one language to another.

C. It summarizes long documents into short paragraphs.

D. It finds and labels important names and terms in text automatically.

NER with spaCy in NLP - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand NER's purpose

Step 2: Compare with other NLP tasks

Final Answer:

Quick Check:

Solution

Step 1: Recall spaCy model loading syntax

Step 2: Check each option

Final Answer:

Quick Check:

Solution

Step 1: Understand spaCy NER labels

Step 2: Match entities with labels

Final Answer:

Quick Check:

Solution

Step 1: Check variable definitions

Step 2: Identify error cause

Final Answer:

Quick Check:

Solution

Step 1: Identify label for persons in spaCy

Step 2: Filter entities by 'PERSON'

Final Answer:

Quick Check: