NlpProgramBeginner · 2 min read

NLP Program to Generate Text Using Python and Hugging Face

Use the Hugging Face transformers library with Python by loading a pre-trained model like gpt2 and calling model.generate() on a tokenized prompt to generate text, for example: outputs = model.generate(input_ids, max_length=50).

📋

Examples

InputHello, how are you?

OutputHello, how are you? I am doing well today and hope you are too.

InputOnce upon a time in a faraway land

OutputOnce upon a time in a faraway land, there lived a brave knight who fought dragons.

Input

OutputThe model generates text starting from an empty prompt, producing a random but coherent sentence.

🧠

How to Think About It

To generate text, first convert your starting words into numbers the model understands using a tokenizer. Then, feed these numbers into a pre-trained language model that predicts the next words step-by-step. Finally, convert the predicted numbers back to words to get the generated text.

📐

Algorithm

Load a pre-trained tokenizer and language model.

Convert the input text prompt into token IDs using the tokenizer.

Use the model to generate new token IDs based on the input tokens.

Decode the generated token IDs back into readable text.

Return or print the generated text.

💻

Code

python

from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

prompt = "Hello, how are you?"
input_ids = tokenizer.encode(prompt, return_tensors='pt')

outputs = model.generate(input_ids, max_length=50, num_return_sequences=1)

text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)

Output

Hello, how are you? I am doing well today and hope you are too.

🔍

Dry Run

Let's trace the example prompt 'Hello, how are you?' through the code

Tokenize input

The prompt 'Hello, how are you?' is converted to token IDs like [15496, 11, 703, 389, 345].

Generate tokens

The model predicts next tokens extending the sequence up to 50 tokens.

Decode output

The generated token IDs are converted back to text, producing a coherent sentence.

Step	Token IDs / Text
Tokenize input	[15496, 11, 703, 389, 345]
Generate tokens	[15496, 11, 703, 389, 345, 314, 257, 703, 389, 345, ...]
Decode output	"Hello, how are you? I am doing well today and hope you are too."

💡

Why This Works

Step 1: Tokenization

The input text is split into tokens and converted to numbers using tokenizer.encode() so the model can understand it.

Step 2: Text generation

The model uses learned patterns to predict the next tokens after the input, generating new text with model.generate().

Step 3: Decoding

The generated token numbers are converted back to readable words using tokenizer.decode().

🔄

Alternative Approaches

Use GPT-3 API

python

import openai
openai.api_key = 'YOUR_API_KEY'
response = openai.Completion.create(engine='text-davinci-003', prompt='Hello, how are you?', max_tokens=50)
print(response.choices[0].text.strip())

This uses a cloud API for text generation, requiring internet and API key, but no local model setup.

Use RNN-based model

python

# Train a simple RNN on text data and generate text step-by-step
# (Requires more code and training time, less powerful than transformers)

Older approach, slower and less fluent text generation compared to transformer models.

⚡

Complexity: O(n) time, O(n) space

Time Complexity

Text generation time grows linearly with the number of tokens generated because the model predicts tokens one after another.

Space Complexity

Memory usage grows with input and output token length due to storing token IDs and model activations.

Which Approach is Fastest?

Using a cloud API like GPT-3 is faster for generation but requires internet; local transformer models are slower but offline.

Approach	Time	Space	Best For
Local GPT-2 Model	O(n)	O(n)	Offline use, customizable
GPT-3 API	Faster (cloud)	Depends on API	Quick results, no setup
RNN Model	Slower	Less	Educational, simple tasks

💡

Use max_length to control how long the generated text will be.

⚠️

Forgetting to set return_tensors='pt' when encoding input causes errors in PyTorch models.