How to Perform Named Entity Recognition (NER) Using NLTK in NLP
To do
Named Entity Recognition (NER) using NLTK, first tokenize and tag the text with part-of-speech tags, then apply nltk.ne_chunk() to identify named entities. This function returns a tree structure with entities like persons, organizations, and locations.Syntax
To perform NER with NLTK, you use the following steps:
nltk.word_tokenize(text): splits text into words.nltk.pos_tag(tokens): tags each word with its part of speech.nltk.ne_chunk(tagged_tokens): identifies named entities from tagged tokens.
The ne_chunk function returns a tree where named entities are grouped as subtrees labeled with entity types like PERSON, ORGANIZATION, or GPE (geopolitical entity).
python
import nltk # Step 1: Tokenize text text = "Apple is looking at buying U.K. startup for $1 billion" tokens = nltk.word_tokenize(text) # Step 2: POS tagging tagged_tokens = nltk.pos_tag(tokens) # Step 3: Named Entity Recognition named_entities = nltk.ne_chunk(tagged_tokens) print(named_entities)
Output
Tree('S', [Tree('ORGANIZATION', [('Apple', 'NNP')]), ('is', 'VBZ'), ('looking', 'VBG'), ('at', 'IN'), ('buying', 'VBG'), Tree('GPE', [('U.K.', 'NNP')]), ('startup', 'NN'), ('for', 'IN'), ('$','$'), ('1', 'CD'), ('billion', 'CD')])
Example
This example shows how to extract named entities from a sentence using NLTK's built-in functions.
python
import nltk # Download required NLTK models if not already done nltk.download('punkt') nltk.download('averaged_perceptron_tagger') nltk.download('maxent_ne_chunker') nltk.download('words') text = "Barack Obama was born in Hawaii and was the 44th President of the United States." # Tokenize tokens = nltk.word_tokenize(text) # POS tagging tagged = nltk.pos_tag(tokens) # Named Entity Recognition entities = nltk.ne_chunk(tagged) # Print named entities with labels for subtree in entities: if hasattr(subtree, 'label'): entity_name = ' '.join(c[0] for c in subtree) entity_type = subtree.label() print(f"{entity_name}: {entity_type}")
Output
Barack Obama: PERSON
Hawaii: GPE
United States: GPE
Common Pitfalls
Common mistakes when doing NER with NLTK include:
- Not tokenizing text before POS tagging, which causes errors.
- Forgetting to download required NLTK data packages like
maxent_ne_chunkerandwords. - Expecting
ne_chunkto return a simple list instead of a tree structure. - Not handling multi-word entities properly when extracting names from the tree.
Always check that you have downloaded all necessary NLTK models and process text in the correct order: tokenize, POS tag, then chunk.
python
import nltk # Wrong: skipping tokenization text = "Google is a tech giant." tagged = nltk.pos_tag(text.split()) # This works but better to use word_tokenize # Right: proper tokenization from nltk.tokenize import word_tokenize tokens = word_tokenize(text) tagged_correct = nltk.pos_tag(tokens) print(tagged) print(tagged_correct)
Output
[('Google', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('tech', 'NN'), ('giant.', 'NN')]
[('Google', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('tech', 'NN'), ('giant', 'NN'), ('.', '.')]
Quick Reference
Summary tips for NER with NLTK:
- Always tokenize text with
word_tokenize(). - Use
pos_tag()to add part-of-speech tags before NER. - Call
ne_chunk()to get named entities as a tree. - Extract entities by checking for subtrees with
label(). - Download required NLTK data packages once using
nltk.download().
| Step | Function | Purpose |
|---|---|---|
| 1 | word_tokenize(text) | Split text into words |
| 2 | pos_tag(tokens) | Tag words with parts of speech |
| 3 | ne_chunk(tagged_tokens) | Identify named entities |
| 4 | subtree.label() | Get entity type from tree |
| - | nltk.download('package') | Download required models |
Key Takeaways
Use nltk.word_tokenize, pos_tag, then ne_chunk in sequence for NER.
ne_chunk returns a tree; extract entities by checking subtree labels.
Always download necessary NLTK data packages before running NER.
Tokenization and POS tagging are essential preprocessing steps.
Handle multi-word entities by joining tokens from subtrees.
