How to Train Custom NER Model with spaCy in NLP
To train a custom Named Entity Recognition (NER) model in
spaCy, prepare your labeled training data with entities, create or update a blank or existing nlp pipeline, add the ner component, and train the model by looping over your data with optimizer updates. Finally, save and test your trained model for predictions.Syntax
The main steps to train a custom NER model in spaCy include:
nlp = spacy.blank('en'): Create a blank English model or load an existing one.ner = nlp.add_pipe('ner'): Add the NER component to the pipeline.ner.add_label('LABEL'): Add your custom entity labels.optimizer = nlp.begin_training(): Initialize the optimizer for training.nlp.update(docs, losses=losses, drop=0.5, sgd=optimizer): Train the model by updating it with your training examples.nlp.to_disk('model_path'): Save the trained model to disk.
python
import spacy # Create blank English model nlp = spacy.blank('en') # Add NER component ner = nlp.add_pipe('ner') # Add custom labels ner.add_label('ORG') # Initialize optimizer optimizer = nlp.begin_training() # Example training loop (simplified) for itn in range(10): losses = {} for text, annotations in TRAIN_DATA: nlp.update([text], [annotations], drop=0.5, sgd=optimizer, losses=losses) print(losses) # Save model nlp.to_disk('custom_ner_model')
Example
This example shows how to train a custom NER model with spaCy on a small dataset with one label ANIMAL. It trains the model for 20 iterations and tests it on a sample sentence.
python
import spacy from spacy.training.example import Example # Training data: text and entities with start/end positions and label TRAIN_DATA = [ ("I have a dog", {"entities": [(7, 10, "ANIMAL")] }), ("She owns a cat", {"entities": [(10, 13, "ANIMAL")] }), ("They saw a rabbit", {"entities": [(10, 16, "ANIMAL")] }) ] # Create blank English model nlp = spacy.blank("en") # Add NER pipe ner = nlp.add_pipe("ner") # Add label ner.add_label("ANIMAL") # Disable other pipes during training other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"] with nlp.disable_pipes(*other_pipes): optimizer = nlp.begin_training() for i in range(20): losses = {} for text, annotations in TRAIN_DATA: doc = nlp.make_doc(text) example = Example.from_dict(doc, annotations) nlp.update([example], drop=0.35, sgd=optimizer, losses=losses) print(f"Iteration {i+1}, Losses: {losses}") # Test the trained model test_text = "My neighbor has a dog and a cat" doc = nlp(test_text) print("Entities in '%s':" % test_text) for ent in doc.ents: print(ent.text, ent.label_)
Output
Iteration 1, Losses: {'ner': 5.123456}
Iteration 2, Losses: {'ner': 3.987654}
...
Iteration 20, Losses: {'ner': 0.123456}
Entities in 'My neighbor has a dog and a cat':
dog ANIMAL
cat ANIMAL
Common Pitfalls
Common mistakes when training custom NER in spaCy include:
- Not adding new labels to the
nercomponent before training. - Updating the model without disabling other pipeline components, which can cause errors.
- Incorrectly formatting training data; entity offsets must be exact character positions.
- Training for too few iterations or with too small a dataset, leading to poor results.
- Not saving the model after training, losing the trained weights.
python
import spacy # Wrong: Not adding label before training nlp = spacy.blank('en') ner = nlp.add_pipe('ner') # ner.add_label('ANIMAL') # Missing label addition # This will cause errors or no learning # Right way: ner.add_label('ANIMAL') # Also disable other pipes during training other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner'] with nlp.disable_pipes(*other_pipes): optimizer = nlp.begin_training() # training code here
Quick Reference
Tips for training custom NER with spaCy:
- Always prepare training data as tuples of (text, {"entities": [(start, end, label)]}).
- Add all new entity labels to the
nerpipe before training. - Use
nlp.disable_pipes()to disable other components during training for efficiency. - Train for multiple iterations and monitor loss to ensure learning.
- Save your model with
nlp.to_disk()and load it later withspacy.load().
Key Takeaways
Prepare training data with exact entity character offsets and labels.
Add custom entity labels to the NER component before training.
Disable other pipeline components during training for better performance.
Train the model over multiple iterations and monitor loss values.
Save and load your trained model to reuse it for predictions.
