Bird
Raised Fist0
NlpHow-ToBeginner ยท 3 min read

How to Use NLTK FreqDist in NLP for Word Frequency Analysis

Use nltk.FreqDist to count how often each word appears in a text. It takes a list of words and returns a frequency distribution object that lets you easily find the most common words or their counts.
๐Ÿ“

Syntax

The basic syntax to create a frequency distribution is FreqDist(samples), where samples is a list of words or tokens. You can then use methods like freqdist.most_common(n) to get the top n frequent words or freqdist[word] to get the count of a specific word.

python
from nltk import FreqDist

# Create a frequency distribution from a list of words
freqdist = FreqDist(['word1', 'word2', 'word1', 'word3'])

# Get count of a word
count_word1 = freqdist['word1']

# Get most common words
top_words = freqdist.most_common(2)
๐Ÿ’ป

Example

This example shows how to tokenize a simple sentence, create a frequency distribution of the words, and print the most common words with their counts.

python
import nltk
from nltk import FreqDist
from nltk.tokenize import word_tokenize

# Download required NLTK data
nltk.download('punkt')

text = "Natural language processing with NLTK is fun and useful. NLTK helps analyze text."

# Tokenize the text into words
words = word_tokenize(text.lower())

# Create frequency distribution
freqdist = FreqDist(words)

# Print the 5 most common words
print(freqdist.most_common(5))
Output
[('nltk', 2), ('and', 1), ('natural', 1), ('language', 1), ('processing', 1)]
โš ๏ธ

Common Pitfalls

One common mistake is not tokenizing the text before passing it to FreqDist, which expects a list of words, not a raw string. Another is ignoring case differences, which can split counts for the same word. Also, punctuation marks are counted as separate tokens unless removed.

python
from nltk import FreqDist

# Wrong: passing raw string
freqdist_wrong = FreqDist("Hello hello world")
print(freqdist_wrong.most_common())

# Right: tokenize and normalize case
from nltk.tokenize import word_tokenize
words = word_tokenize("Hello hello world".lower())
freqdist_right = FreqDist(words)
print(freqdist_right.most_common())
Output
[('H', 1), ('e', 1), ('l', 3), ('o', 2), (' ', 2), ('h', 1), ('w', 1), ('r', 1), ('d', 1)] [('hello', 2), ('world', 1)]
๐Ÿ“Š

Quick Reference

MethodDescription
FreqDist(samples)Create frequency distribution from list of words
freqdist[word]Get count of a specific word
freqdist.most_common(n)Get list of top n most frequent words and counts
freqdist.keys()Get all unique words in the distribution
freqdist.N()Get total number of samples (words) counted
โœ…

Key Takeaways

Always tokenize and normalize text before using FreqDist for accurate counts.
FreqDist counts how often each word appears and helps find the most common words easily.
Use freqdist.most_common(n) to get the top n frequent words with their counts.
Punctuation and case differences affect counts unless handled properly.
FreqDist is a simple tool for quick word frequency analysis in NLP tasks.