NlpHow-ToBeginner · 3 min read

How to Use NLTK FreqDist in NLP for Word Frequency Analysis

Use nltk.FreqDist to count how often each word appears in a text. It takes a list of words and returns a frequency distribution object that lets you easily find the most common words or their counts.

📐

Syntax

The basic syntax to create a frequency distribution is FreqDist(samples), where samples is a list of words or tokens. You can then use methods like freqdist.most_common(n) to get the top n frequent words or freqdist[word] to get the count of a specific word.

python

from nltk import FreqDist

# Create a frequency distribution from a list of words
freqdist = FreqDist(['word1', 'word2', 'word1', 'word3'])

# Get count of a word
count_word1 = freqdist['word1']

# Get most common words
top_words = freqdist.most_common(2)

💻

Example

This example shows how to tokenize a simple sentence, create a frequency distribution of the words, and print the most common words with their counts.

python

import nltk
from nltk import FreqDist
from nltk.tokenize import word_tokenize

# Download required NLTK data
nltk.download('punkt')

text = "Natural language processing with NLTK is fun and useful. NLTK helps analyze text."

# Tokenize the text into words
words = word_tokenize(text.lower())

# Create frequency distribution
freqdist = FreqDist(words)

# Print the 5 most common words
print(freqdist.most_common(5))

Output

[('nltk', 2), ('and', 1), ('natural', 1), ('language', 1), ('processing', 1)]

⚠️

Common Pitfalls

One common mistake is not tokenizing the text before passing it to FreqDist, which expects a list of words, not a raw string. Another is ignoring case differences, which can split counts for the same word. Also, punctuation marks are counted as separate tokens unless removed.

python

from nltk import FreqDist

# Wrong: passing raw string
freqdist_wrong = FreqDist("Hello hello world")
print(freqdist_wrong.most_common())

# Right: tokenize and normalize case
from nltk.tokenize import word_tokenize
words = word_tokenize("Hello hello world".lower())
freqdist_right = FreqDist(words)
print(freqdist_right.most_common())

Output

[('H', 1), ('e', 1), ('l', 3), ('o', 2), (' ', 2), ('h', 1), ('w', 1), ('r', 1), ('d', 1)] [('hello', 2), ('world', 1)]

📊

Quick Reference

Method	Description
FreqDist(samples)	Create frequency distribution from list of words
freqdist[word]	Get count of a specific word
freqdist.most_common(n)	Get list of top n most frequent words and counts
freqdist.keys()	Get all unique words in the distribution
freqdist.N()	Get total number of samples (words) counted

✅

Key Takeaways

Always tokenize and normalize text before using FreqDist for accurate counts.

FreqDist counts how often each word appears and helps find the most common words easily.

Use freqdist.most_common(n) to get the top n frequent words with their counts.

Punctuation and case differences affect counts unless handled properly.

FreqDist is a simple tool for quick word frequency analysis in NLP tasks.