NlpProgramBeginner · 2 min read

NLP Program to Extract Keywords Using Python

Use Python's sklearn.feature_extraction.text.TfidfVectorizer to extract keywords by fitting it on text and selecting top scoring words with get_feature_names_out() and sorting by TF-IDF scores.

📋

Examples

InputNatural language processing helps computers understand human language.

Output['language', 'natural', 'processing', 'helps', 'computers', 'understand', 'human']

InputMachine learning and deep learning are popular AI techniques.

Output['learning', 'machine', 'deep', 'popular', 'techniques', 'ai']

Input

Output[]

🧠

How to Think About It

To extract keywords, first split the text into words, then score each word by how important it is in the text compared to other texts. Words that appear often in this text but rarely in others get higher scores. Finally, pick the top scoring words as keywords.

📐

Algorithm

Get input text from the user.

Convert text into a list of words and calculate TF-IDF scores for each word.

Sort words by their TF-IDF scores in descending order.

Select the top N words as keywords.

Return or print the list of keywords.

💻

Code

python

from sklearn.feature_extraction.text import TfidfVectorizer

def extract_keywords(text, top_n=5):
    if not text.strip():
        return []
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = vectorizer.fit_transform([text])
    feature_names = vectorizer.get_feature_names_out()
    scores = tfidf_matrix.toarray()[0]
    top_indices = scores.argsort()[::-1][:top_n]
    keywords = [feature_names[i] for i in top_indices if scores[i] > 0]
    return keywords

sample_text = "Natural language processing helps computers understand human language."
print(extract_keywords(sample_text, top_n=7))

Output

['language', 'natural', 'processing', 'helps', 'computers', 'understand', 'human']

🔍

Dry Run

Let's trace the example 'Natural language processing helps computers understand human language.' through the code.

Input Text

text = 'Natural language processing helps computers understand human language.'

Create TF-IDF Vectorizer

vectorizer = TfidfVectorizer(stop_words='english')

Fit and Transform Text

tfidf_matrix = vectorizer.fit_transform([text]) feature_names = vectorizer.get_feature_names_out()

Get Scores and Sort

scores = tfidf_matrix.toarray()[0] top_indices = scores.argsort()[::-1][:7]

Select Keywords

keywords = [feature_names[i] for i in top_indices if scores[i] > 0] keywords = ['language', 'natural', 'processing', 'helps', 'computers', 'understand', 'human']

Word	TF-IDF Score
language	0.58
natural	0.41
processing	0.41
helps	0.41
computers	0.41
understand	0.41
human	0.41

💡

Why This Works

Step 1: TF-IDF Vectorizer

The TfidfVectorizer converts text into numbers that show how important each word is compared to common words.

Step 2: Stop Words Removal

Common words like 'the' and 'and' are removed automatically to focus on meaningful words.

Step 3: Sorting by Score

Words are sorted by their TF-IDF scores so the most important keywords come first.

🔄

Alternative Approaches

RAKE (Rapid Automatic Keyword Extraction)

python

from rake_nltk import Rake

def extract_keywords_rake(text, top_n=5):
    r = Rake()
    r.extract_keywords_from_text(text)
    return r.get_ranked_phrases()[:top_n]

sample_text = 'Natural language processing helps computers understand human language.'
print(extract_keywords_rake(sample_text, top_n=5))

RAKE extracts keyword phrases without needing other documents but may include longer phrases.

Using spaCy noun chunks

python

import spacy

nlp = spacy.load('en_core_web_sm')

def extract_keywords_spacy(text, top_n=5):
    doc = nlp(text)
    chunks = [chunk.text for chunk in doc.noun_chunks]
    return chunks[:top_n]

sample_text = 'Natural language processing helps computers understand human language.'
print(extract_keywords_spacy(sample_text, top_n=5))

spaCy extracts noun phrases as keywords but does not rank by importance.

⚡

Complexity: O(n) time, O(n) space

Time Complexity

TF-IDF vectorization processes each word once, so time grows linearly with text length.

Space Complexity

Stores word scores and vocabulary, so space grows with number of unique words.

Which Approach is Fastest?

TF-IDF is fast and simple for single documents; RAKE is slower but extracts phrases; spaCy is heavier but provides linguistic features.

Approach	Time	Space	Best For
TF-IDF	O(n)	O(n)	Quick keyword extraction from single text
RAKE	O(n)	O(n)	Extracting keyword phrases without extra data
spaCy noun chunks	O(n)	O(n)	Linguistic phrase extraction, no ranking

💡

Use stop_words='english' in TF-IDF to ignore common words and get better keywords.

⚠️

Beginners often forget to remove stop words, causing common words to appear as keywords.