How to do NER using NLTK in nlp

NlpHow-ToBeginner · 4 min read

How to Perform Named Entity Recognition (NER) Using NLTK in NLP

To do Named Entity Recognition (NER) using NLTK, first tokenize and tag the text with part-of-speech tags, then apply nltk.ne_chunk() to identify named entities. This function returns a tree structure with entities like persons, organizations, and locations.

📐

Syntax

To perform NER with NLTK, you use the following steps:

nltk.word_tokenize(text): splits text into words.
nltk.pos_tag(tokens): tags each word with its part of speech.
nltk.ne_chunk(tagged_tokens): identifies named entities from tagged tokens.

The ne_chunk function returns a tree where named entities are grouped as subtrees labeled with entity types like PERSON, ORGANIZATION, or GPE (geopolitical entity).

python

import nltk

# Step 1: Tokenize text
text = "Apple is looking at buying U.K. startup for $1 billion"
tokens = nltk.word_tokenize(text)

# Step 2: POS tagging
tagged_tokens = nltk.pos_tag(tokens)

# Step 3: Named Entity Recognition
named_entities = nltk.ne_chunk(tagged_tokens)

print(named_entities)

Output

Tree('S', [Tree('ORGANIZATION', [('Apple', 'NNP')]), ('is', 'VBZ'), ('looking', 'VBG'), ('at', 'IN'), ('buying', 'VBG'), Tree('GPE', [('U.K.', 'NNP')]), ('startup', 'NN'), ('for', 'IN'), ('$','$'), ('1', 'CD'), ('billion', 'CD')])

💻

Example

This example shows how to extract named entities from a sentence using NLTK's built-in functions.

python

import nltk

# Download required NLTK models if not already done
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

text = "Barack Obama was born in Hawaii and was the 44th President of the United States."

# Tokenize
tokens = nltk.word_tokenize(text)

# POS tagging
tagged = nltk.pos_tag(tokens)

# Named Entity Recognition
entities = nltk.ne_chunk(tagged)

# Print named entities with labels
for subtree in entities:
    if hasattr(subtree, 'label'):
        entity_name = ' '.join(c[0] for c in subtree)
        entity_type = subtree.label()
        print(f"{entity_name}: {entity_type}")

Output

Barack Obama: PERSON Hawaii: GPE United States: GPE

⚠️

Common Pitfalls

Common mistakes when doing NER with NLTK include:

Not tokenizing text before POS tagging, which causes errors.
Forgetting to download required NLTK data packages like maxent_ne_chunker and words.
Expecting ne_chunk to return a simple list instead of a tree structure.
Not handling multi-word entities properly when extracting names from the tree.

Always check that you have downloaded all necessary NLTK models and process text in the correct order: tokenize, POS tag, then chunk.

python

import nltk

# Wrong: skipping tokenization
text = "Google is a tech giant."
tagged = nltk.pos_tag(text.split())  # This works but better to use word_tokenize

# Right: proper tokenization
from nltk.tokenize import word_tokenize

tokens = word_tokenize(text)
tagged_correct = nltk.pos_tag(tokens)

print(tagged)
print(tagged_correct)

Output

[('Google', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('tech', 'NN'), ('giant.', 'NN')] [('Google', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('tech', 'NN'), ('giant', 'NN'), ('.', '.')]

📊

Quick Reference

Summary tips for NER with NLTK:

Always tokenize text with word_tokenize().
Use pos_tag() to add part-of-speech tags before NER.
Call ne_chunk() to get named entities as a tree.
Extract entities by checking for subtrees with label().
Download required NLTK data packages once using nltk.download().

Step	Function	Purpose
1	word_tokenize(text)	Split text into words
2	pos_tag(tokens)	Tag words with parts of speech
3	ne_chunk(tagged_tokens)	Identify named entities
4	subtree.label()	Get entity type from tree
-	nltk.download('package')	Download required models

✅

Key Takeaways

Use nltk.word_tokenize, pos_tag, then ne_chunk in sequence for NER.

ne_chunk returns a tree; extract entities by checking subtree labels.

Always download necessary NLTK data packages before running NER.

Tokenization and POS tagging are essential preprocessing steps.

Handle multi-word entities by joining tokens from subtrees.