How to Do POS Tagging with NLTK in NLP
Use the
nltk.pos_tag() function to assign part-of-speech tags to words in a sentence. First, tokenize the sentence with nltk.word_tokenize(), then pass the tokens to pos_tag() to get tagged word tuples.Syntax
The main function for POS tagging in NLTK is nltk.pos_tag(tokens). Here, tokens is a list of words (tokens) from your text. The function returns a list of tuples where each tuple contains a word and its POS tag.
Before tagging, you usually tokenize the sentence using nltk.word_tokenize(text), which splits the sentence into words.
python
import nltk # Tokenize a sentence into words tokens = nltk.word_tokenize('This is a simple sentence.') # POS tag the tokens pos_tags = nltk.pos_tag(tokens) print(pos_tags)
Output
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('simple', 'JJ'), ('sentence', 'NN'), ('.', '.')]
Example
This example shows how to tokenize a sentence and then apply POS tagging using NLTK. The output is a list of word-tag pairs.
python
import nltk # Download required resources (only once) nltk.download('punkt') nltk.download('averaged_perceptron_tagger') sentence = 'NLTK is a great library for natural language processing.' # Tokenize the sentence tokens = nltk.word_tokenize(sentence) # Get POS tags pos_tags = nltk.pos_tag(tokens) print(pos_tags)
Output
[('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('great', 'JJ'), ('library', 'NN'), ('for', 'IN'), ('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('.', '.')]
Common Pitfalls
- Not downloading required NLTK data packages like
punktandaveraged_perceptron_taggerbefore running POS tagging causes errors. - Passing a raw sentence string directly to
pos_tag()instead of a token list will cause incorrect results or errors. - Ignoring punctuation tokens can lead to confusion; punctuation is also tagged.
python
import nltk # Wrong way: passing string directly sentence = 'This is wrong.' try: print(nltk.pos_tag(list(sentence))) except Exception as e: print('Error:', e) # Right way: tokenize first tokens = nltk.word_tokenize(sentence) print(nltk.pos_tag(tokens))
Output
Error: Expected a list of tokens, got a string
[('This', 'DT'), ('is', 'VBZ'), ('wrong', 'JJ'), ('.', '.')]
Quick Reference
Here is a quick summary of key functions and tags:
| Function/Tag | Description |
|---|---|
| nltk.word_tokenize(text) | Splits text into a list of word tokens |
| nltk.pos_tag(tokens) | Tags tokens with part-of-speech labels |
| DT | Determiner (e.g., 'the', 'a') |
| NN | Noun, singular (e.g., 'dog') |
| NNS | Noun, plural (e.g., 'dogs') |
| JJ | Adjective (e.g., 'big') |
| VBZ | Verb, 3rd person singular present (e.g., 'runs') |
| IN | Preposition or subordinating conjunction (e.g., 'in', 'of') |
| . | Punctuation mark |
Key Takeaways
Always tokenize your sentence with nltk.word_tokenize before POS tagging.
Use nltk.pos_tag to get part-of-speech tags as (word, tag) tuples.
Download required NLTK data packages like 'punkt' and 'averaged_perceptron_tagger' before use.
POS tags follow the Penn Treebank tag set, which includes tags like NN, JJ, VBZ, etc.
Passing raw strings to pos_tag causes errors; always pass a list of tokens.
