How to do POS tagging NLTK in nlp

NlpHow-ToBeginner · 3 min read

How to Do POS Tagging with NLTK in NLP

Use the nltk.pos_tag() function to assign part-of-speech tags to words in a sentence. First, tokenize the sentence with nltk.word_tokenize(), then pass the tokens to pos_tag() to get tagged word tuples.

📐

Syntax

The main function for POS tagging in NLTK is nltk.pos_tag(tokens). Here, tokens is a list of words (tokens) from your text. The function returns a list of tuples where each tuple contains a word and its POS tag.

Before tagging, you usually tokenize the sentence using nltk.word_tokenize(text), which splits the sentence into words.

python

import nltk

# Tokenize a sentence into words
tokens = nltk.word_tokenize('This is a simple sentence.')

# POS tag the tokens
pos_tags = nltk.pos_tag(tokens)

print(pos_tags)

Output

[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('simple', 'JJ'), ('sentence', 'NN'), ('.', '.')]

💻

Example

This example shows how to tokenize a sentence and then apply POS tagging using NLTK. The output is a list of word-tag pairs.

python

import nltk

# Download required resources (only once)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

sentence = 'NLTK is a great library for natural language processing.'

# Tokenize the sentence
tokens = nltk.word_tokenize(sentence)

# Get POS tags
pos_tags = nltk.pos_tag(tokens)

print(pos_tags)

Output

[('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('great', 'JJ'), ('library', 'NN'), ('for', 'IN'), ('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('.', '.')]

⚠️

Common Pitfalls

Not downloading required NLTK data packages like punkt and averaged_perceptron_tagger before running POS tagging causes errors.
Passing a raw sentence string directly to pos_tag() instead of a token list will cause incorrect results or errors.
Ignoring punctuation tokens can lead to confusion; punctuation is also tagged.

python

import nltk

# Wrong way: passing string directly
sentence = 'This is wrong.'
try:
    print(nltk.pos_tag(list(sentence)))
except Exception as e:
    print('Error:', e)

# Right way: tokenize first
tokens = nltk.word_tokenize(sentence)
print(nltk.pos_tag(tokens))

Output

Error: Expected a list of tokens, got a string [('This', 'DT'), ('is', 'VBZ'), ('wrong', 'JJ'), ('.', '.')]

📊

Quick Reference

Here is a quick summary of key functions and tags:

Function/Tag	Description
nltk.word_tokenize(text)	Splits text into a list of word tokens
nltk.pos_tag(tokens)	Tags tokens with part-of-speech labels
DT	Determiner (e.g., 'the', 'a')
NN	Noun, singular (e.g., 'dog')
NNS	Noun, plural (e.g., 'dogs')
JJ	Adjective (e.g., 'big')
VBZ	Verb, 3rd person singular present (e.g., 'runs')
IN	Preposition or subordinating conjunction (e.g., 'in', 'of')
.	Punctuation mark

✅

Key Takeaways

Always tokenize your sentence with nltk.word_tokenize before POS tagging.

Use nltk.pos_tag to get part-of-speech tags as (word, tag) tuples.

Download required NLTK data packages like 'punkt' and 'averaged_perceptron_tagger' before use.

POS tags follow the Penn Treebank tag set, which includes tags like NN, JJ, VBZ, etc.

Passing raw strings to pos_tag causes errors; always pass a list of tokens.