Bird
Raised Fist0
NlpHow-ToBeginner · 4 min read

How to Use GPT for Text Generation in NLP

To use GPT for text generation in NLP, you input a prompt text to the model and it predicts the next words to generate coherent text. This is done by calling the model's generate method with parameters like prompt, max length, and temperature to control creativity.
📐

Syntax

The basic syntax to generate text with GPT involves loading a pretrained GPT model and tokenizer, then calling the generate method with a prompt. Key parts include:

  • Tokenizer: Converts text to tokens the model understands.
  • Model: The GPT model that predicts next tokens.
  • generate: Method to produce text continuation.
  • Parameters: max_length controls output length, temperature controls randomness.
python
from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

input_text = "Your prompt here"
input_ids = tokenizer.encode(input_text, return_tensors='pt')

outputs = model.generate(input_ids, max_length=50, temperature=0.7, do_sample=True)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
💻

Example

This example shows how to generate text starting from a simple prompt using GPT-2. It loads the model and tokenizer, encodes the prompt, generates text up to 50 tokens, and decodes it back to readable text.

python
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load pretrained GPT-2 model and tokenizer
model_name = 'gpt2'
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# Input prompt
prompt = "Once upon a time"
input_ids = tokenizer.encode(prompt, return_tensors='pt')

# Generate text
outputs = model.generate(input_ids, max_length=50, temperature=0.8, do_sample=True, top_k=50)

# Decode and print generated text
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
Output
Once upon a time, there was a small village nestled in the mountains. The villagers lived peacefully, unaware of the secrets hidden deep within the forest.
⚠️

Common Pitfalls

Common mistakes when using GPT for text generation include:

  • Not setting do_sample=True for creative outputs, which causes repetitive text.
  • Using too high max_length leading to very long or nonsensical text.
  • Ignoring temperature and top_k parameters that control randomness and diversity.
  • Not handling special tokens properly when decoding output.
python
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Wrong: deterministic output without sampling
model_name = 'gpt2'
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

prompt = "Hello"
input_ids = tokenizer.encode(prompt, return_tensors='pt')

# Missing do_sample=True leads to repetitive output
outputs = model.generate(input_ids, max_length=30)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# Right: enable sampling for varied output
outputs_sample = model.generate(input_ids, max_length=30, do_sample=True, temperature=0.7, top_k=40)
print(tokenizer.decode(outputs_sample[0], skip_special_tokens=True))
Output
Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello there! How are you doing today? I hope everything is going well and you are having a great time.
📊

Quick Reference

ParameterDescriptionTypical Values
max_lengthMaximum length of generated text20-100 tokens
temperatureControls randomness; lower is conservative, higher is creative0.6 - 1.0
do_sampleEnable sampling for varied outputTrue or False
top_kLimits sampling to top K tokens10-50
skip_special_tokensRemove special tokens when decodingTrue

Key Takeaways

Use GPT's generate method with a prompt to produce text continuations.
Set do_sample=True and adjust temperature for creative and varied outputs.
Control output length with max_length to avoid overly long text.
Always decode generated tokens with skip_special_tokens=True for clean text.
Experiment with top_k and temperature to balance coherence and creativity.