How to Use NLTK FreqDist in NLP for Word Frequency Analysis
Use
nltk.FreqDist to count how often each word appears in a text. It takes a list of words and returns a frequency distribution object that lets you easily find the most common words or their counts.Syntax
The basic syntax to create a frequency distribution is FreqDist(samples), where samples is a list of words or tokens. You can then use methods like freqdist.most_common(n) to get the top n frequent words or freqdist[word] to get the count of a specific word.
python
from nltk import FreqDist # Create a frequency distribution from a list of words freqdist = FreqDist(['word1', 'word2', 'word1', 'word3']) # Get count of a word count_word1 = freqdist['word1'] # Get most common words top_words = freqdist.most_common(2)
Example
This example shows how to tokenize a simple sentence, create a frequency distribution of the words, and print the most common words with their counts.
python
import nltk from nltk import FreqDist from nltk.tokenize import word_tokenize # Download required NLTK data nltk.download('punkt') text = "Natural language processing with NLTK is fun and useful. NLTK helps analyze text." # Tokenize the text into words words = word_tokenize(text.lower()) # Create frequency distribution freqdist = FreqDist(words) # Print the 5 most common words print(freqdist.most_common(5))
Output
[('nltk', 2), ('and', 1), ('natural', 1), ('language', 1), ('processing', 1)]
Common Pitfalls
One common mistake is not tokenizing the text before passing it to FreqDist, which expects a list of words, not a raw string. Another is ignoring case differences, which can split counts for the same word. Also, punctuation marks are counted as separate tokens unless removed.
python
from nltk import FreqDist # Wrong: passing raw string freqdist_wrong = FreqDist("Hello hello world") print(freqdist_wrong.most_common()) # Right: tokenize and normalize case from nltk.tokenize import word_tokenize words = word_tokenize("Hello hello world".lower()) freqdist_right = FreqDist(words) print(freqdist_right.most_common())
Output
[('H', 1), ('e', 1), ('l', 3), ('o', 2), (' ', 2), ('h', 1), ('w', 1), ('r', 1), ('d', 1)]
[('hello', 2), ('world', 1)]
Quick Reference
| Method | Description |
|---|---|
| FreqDist(samples) | Create frequency distribution from list of words |
| freqdist[word] | Get count of a specific word |
| freqdist.most_common(n) | Get list of top n most frequent words and counts |
| freqdist.keys() | Get all unique words in the distribution |
| freqdist.N() | Get total number of samples (words) counted |
Key Takeaways
Always tokenize and normalize text before using FreqDist for accurate counts.
FreqDist counts how often each word appears and helps find the most common words easily.
Use freqdist.most_common(n) to get the top n frequent words with their counts.
Punctuation and case differences affect counts unless handled properly.
FreqDist is a simple tool for quick word frequency analysis in NLP tasks.
