Bird
Raised Fist0
NlpHow-ToBeginner · 4 min read

How to Perform Named Entity Recognition (NER) Using NLTK in NLP

To do Named Entity Recognition (NER) using NLTK, first tokenize and tag the text with part-of-speech tags, then apply nltk.ne_chunk() to identify named entities. This function returns a tree structure with entities like persons, organizations, and locations.
📐

Syntax

To perform NER with NLTK, you use the following steps:

  • nltk.word_tokenize(text): splits text into words.
  • nltk.pos_tag(tokens): tags each word with its part of speech.
  • nltk.ne_chunk(tagged_tokens): identifies named entities from tagged tokens.

The ne_chunk function returns a tree where named entities are grouped as subtrees labeled with entity types like PERSON, ORGANIZATION, or GPE (geopolitical entity).

python
import nltk

# Step 1: Tokenize text
text = "Apple is looking at buying U.K. startup for $1 billion"
tokens = nltk.word_tokenize(text)

# Step 2: POS tagging
tagged_tokens = nltk.pos_tag(tokens)

# Step 3: Named Entity Recognition
named_entities = nltk.ne_chunk(tagged_tokens)

print(named_entities)
Output
Tree('S', [Tree('ORGANIZATION', [('Apple', 'NNP')]), ('is', 'VBZ'), ('looking', 'VBG'), ('at', 'IN'), ('buying', 'VBG'), Tree('GPE', [('U.K.', 'NNP')]), ('startup', 'NN'), ('for', 'IN'), ('$','$'), ('1', 'CD'), ('billion', 'CD')])
💻

Example

This example shows how to extract named entities from a sentence using NLTK's built-in functions.

python
import nltk

# Download required NLTK models if not already done
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

text = "Barack Obama was born in Hawaii and was the 44th President of the United States."

# Tokenize
tokens = nltk.word_tokenize(text)

# POS tagging
tagged = nltk.pos_tag(tokens)

# Named Entity Recognition
entities = nltk.ne_chunk(tagged)

# Print named entities with labels
for subtree in entities:
    if hasattr(subtree, 'label'):
        entity_name = ' '.join(c[0] for c in subtree)
        entity_type = subtree.label()
        print(f"{entity_name}: {entity_type}")
Output
Barack Obama: PERSON Hawaii: GPE United States: GPE
⚠️

Common Pitfalls

Common mistakes when doing NER with NLTK include:

  • Not tokenizing text before POS tagging, which causes errors.
  • Forgetting to download required NLTK data packages like maxent_ne_chunker and words.
  • Expecting ne_chunk to return a simple list instead of a tree structure.
  • Not handling multi-word entities properly when extracting names from the tree.

Always check that you have downloaded all necessary NLTK models and process text in the correct order: tokenize, POS tag, then chunk.

python
import nltk

# Wrong: skipping tokenization
text = "Google is a tech giant."
tagged = nltk.pos_tag(text.split())  # This works but better to use word_tokenize

# Right: proper tokenization
from nltk.tokenize import word_tokenize

tokens = word_tokenize(text)
tagged_correct = nltk.pos_tag(tokens)

print(tagged)
print(tagged_correct)
Output
[('Google', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('tech', 'NN'), ('giant.', 'NN')] [('Google', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('tech', 'NN'), ('giant', 'NN'), ('.', '.')]
📊

Quick Reference

Summary tips for NER with NLTK:

  • Always tokenize text with word_tokenize().
  • Use pos_tag() to add part-of-speech tags before NER.
  • Call ne_chunk() to get named entities as a tree.
  • Extract entities by checking for subtrees with label().
  • Download required NLTK data packages once using nltk.download().
StepFunctionPurpose
1word_tokenize(text)Split text into words
2pos_tag(tokens)Tag words with parts of speech
3ne_chunk(tagged_tokens)Identify named entities
4subtree.label()Get entity type from tree
-nltk.download('package')Download required models

Key Takeaways

Use nltk.word_tokenize, pos_tag, then ne_chunk in sequence for NER.
ne_chunk returns a tree; extract entities by checking subtree labels.
Always download necessary NLTK data packages before running NER.
Tokenization and POS tagging are essential preprocessing steps.
Handle multi-word entities by joining tokens from subtrees.