NLP Program to Extract Keywords Using Python
sklearn.feature_extraction.text.TfidfVectorizer to extract keywords by fitting it on text and selecting top scoring words with get_feature_names_out() and sorting by TF-IDF scores.Examples
How to Think About It
Algorithm
Code
from sklearn.feature_extraction.text import TfidfVectorizer def extract_keywords(text, top_n=5): if not text.strip(): return [] vectorizer = TfidfVectorizer(stop_words='english') tfidf_matrix = vectorizer.fit_transform([text]) feature_names = vectorizer.get_feature_names_out() scores = tfidf_matrix.toarray()[0] top_indices = scores.argsort()[::-1][:top_n] keywords = [feature_names[i] for i in top_indices if scores[i] > 0] return keywords sample_text = "Natural language processing helps computers understand human language." print(extract_keywords(sample_text, top_n=7))
Dry Run
Let's trace the example 'Natural language processing helps computers understand human language.' through the code.
Input Text
text = 'Natural language processing helps computers understand human language.'
Create TF-IDF Vectorizer
vectorizer = TfidfVectorizer(stop_words='english')
Fit and Transform Text
tfidf_matrix = vectorizer.fit_transform([text]) feature_names = vectorizer.get_feature_names_out()
Get Scores and Sort
scores = tfidf_matrix.toarray()[0] top_indices = scores.argsort()[::-1][:7]
Select Keywords
keywords = [feature_names[i] for i in top_indices if scores[i] > 0] keywords = ['language', 'natural', 'processing', 'helps', 'computers', 'understand', 'human']
| Word | TF-IDF Score |
|---|---|
| language | 0.58 |
| natural | 0.41 |
| processing | 0.41 |
| helps | 0.41 |
| computers | 0.41 |
| understand | 0.41 |
| human | 0.41 |
Why This Works
Step 1: TF-IDF Vectorizer
The TfidfVectorizer converts text into numbers that show how important each word is compared to common words.
Step 2: Stop Words Removal
Common words like 'the' and 'and' are removed automatically to focus on meaningful words.
Step 3: Sorting by Score
Words are sorted by their TF-IDF scores so the most important keywords come first.
Alternative Approaches
from rake_nltk import Rake def extract_keywords_rake(text, top_n=5): r = Rake() r.extract_keywords_from_text(text) return r.get_ranked_phrases()[:top_n] sample_text = 'Natural language processing helps computers understand human language.' print(extract_keywords_rake(sample_text, top_n=5))
import spacy nlp = spacy.load('en_core_web_sm') def extract_keywords_spacy(text, top_n=5): doc = nlp(text) chunks = [chunk.text for chunk in doc.noun_chunks] return chunks[:top_n] sample_text = 'Natural language processing helps computers understand human language.' print(extract_keywords_spacy(sample_text, top_n=5))
Complexity: O(n) time, O(n) space
Time Complexity
TF-IDF vectorization processes each word once, so time grows linearly with text length.
Space Complexity
Stores word scores and vocabulary, so space grows with number of unique words.
Which Approach is Fastest?
TF-IDF is fast and simple for single documents; RAKE is slower but extracts phrases; spaCy is heavier but provides linguistic features.
| Approach | Time | Space | Best For |
|---|---|---|---|
| TF-IDF | O(n) | O(n) | Quick keyword extraction from single text |
| RAKE | O(n) | O(n) | Extracting keyword phrases without extra data |
| spaCy noun chunks | O(n) | O(n) | Linguistic phrase extraction, no ranking |
stop_words='english' in TF-IDF to ignore common words and get better keywords.