NLP Program to Summarize Text Using Python
pipeline('summarization') to create a simple NLP program that summarizes text, for example: from transformers import pipeline; summarizer = pipeline('summarization'); summary = summarizer(text)[0]['summary_text'].Examples
How to Think About It
Algorithm
Code
from transformers import pipeline # Load summarization pipeline summarizer = pipeline('summarization') # Input text text = "Machine learning is a method of data analysis that automates analytical model building." # Generate summary summary = summarizer(text, max_length=30, min_length=5, do_sample=False)[0]['summary_text'] print(summary)
Dry Run
Let's trace the example text through the summarization code.
Load summarization pipeline
summarizer is set to a model that can summarize text.
Input text
text = "Machine learning is a method of data analysis that automates analytical model building."
Generate summary
summarizer processes the text and returns [{'summary_text': 'Machine learning automates analytical model building.'}]
Extract summary
summary = 'Machine learning automates analytical model building.'
Print summary
Output: Machine learning automates analytical model building.
| Step | Action | Value |
|---|---|---|
| 1 | Load model | summarizer pipeline ready |
| 2 | Input text | Machine learning is a method of data analysis that automates analytical model building. |
| 3 | Model output | [{'summary_text': 'Machine learning automates analytical model building.'}] |
| 4 | Extract summary | Machine learning automates analytical model building. |
| 5 | Machine learning automates analytical model building. |
Why This Works
Step 1: Load summarization pipeline
The pipeline('summarization') loads a pre-trained model that knows how to shorten text while keeping meaning.
Step 2: Input text
We provide the full text we want to summarize as input to the model.
Step 3: Generate summary
The model processes the input and creates a shorter version that captures the main ideas.
Step 4: Output summary
We extract the summary text from the model's output and print it for the user.
Alternative Approaches
import nltk from nltk.tokenize import sent_tokenize, word_tokenize from nltk.corpus import stopwords from collections import defaultdict nltk.download('punkt') nltk.download('stopwords') text = "Machine learning is a method of data analysis that automates analytical model building." sentences = sent_tokenize(text) stop_words = set(stopwords.words('english')) word_frequencies = defaultdict(int) for word in word_tokenize(text.lower()): if word.isalpha() and word not in stop_words: word_frequencies[word] += 1 max_freq = max(word_frequencies.values()) for word in word_frequencies: word_frequencies[word] /= max_freq sentence_scores = defaultdict(int) for sent in sentences: for word in word_tokenize(sent.lower()): if word in word_frequencies: sentence_scores[sent] += word_frequencies[word] summary_sentences = sorted(sentence_scores, key=sentence_scores.get, reverse=True)[:1] summary = ' '.join(summary_sentences) print(summary)
import openai openai.api_key = 'YOUR_API_KEY' text = "Machine learning is a method of data analysis that automates analytical model building." response = openai.Completion.create( engine='text-davinci-003', prompt=f'Summarize this: {text}', max_tokens=50 ) summary = response.choices[0].text.strip() print(summary)
Complexity: O(n) time, O(n) space
Time Complexity
The summarization model processes the input text once, so time grows linearly with text length.
Space Complexity
The model stores input and output text plus internal states, so space grows linearly with input size.
Which Approach is Fastest?
Extractive methods are faster but less accurate; transformer models like in Hugging Face are slower but produce better summaries.
| Approach | Time | Space | Best For |
|---|---|---|---|
| Transformer summarization | O(n) | O(n) | High-quality summaries, moderate text length |
| Extractive summarization | O(n) | O(n) | Fast summaries, simple use cases |
| API-based GPT-3 summarization | Depends on API latency | Minimal local | Very high-quality, requires internet and API key |
max_length and min_length parameters to control summary size.