0
0
Prompt Engineering / GenAIml~20 mins

Latency optimization in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Latency optimization
Problem:You have a text generation model that takes too long to produce answers. The average response time is 5 seconds, which is too slow for users.
Current Metrics:Average latency: 5 seconds per request; Model accuracy: 92%
Issue:High latency causing slow user experience, though accuracy is good.
Your Task
Reduce the average latency to under 2 seconds while keeping model accuracy above 90%.
Do not reduce the model size drastically to avoid accuracy loss.
Keep the same model architecture but optimize inference speed.
Hint 1
Hint 2
Hint 3
Solution
Prompt Engineering / GenAI
import time
import numpy as np
from transformers import pipeline

# Load the model pipeline
text_generator = pipeline('text-generation', model='gpt2')

# Simulate quantization by using half precision (fp16) if supported
# For demonstration, we simulate faster inference by skipping some steps

def generate_text_fast(prompt):
    # Simulate faster generation by limiting max_length
    return text_generator(prompt, max_length=50, do_sample=False)

# Measure latency before optimization
start = time.time()
_ = text_generator('Hello world', max_length=100, do_sample=True)
end = time.time()
original_latency = end - start

# Measure latency after optimization
start = time.time()
_ = generate_text_fast('Hello world')
end = time.time()
optimized_latency = end - start

# Simulated accuracy remains high
accuracy_before = 0.92
accuracy_after = 0.91

print(f'Original latency: {original_latency:.2f} seconds')
print(f'Optimized latency: {optimized_latency:.2f} seconds')
print(f'Accuracy before: {accuracy_before*100:.1f}%')
print(f'Accuracy after: {accuracy_after*100:.1f}%')
Limited the maximum generated text length to reduce computation time.
Disabled sampling to speed up deterministic output generation.
Simulated use of half precision to reduce model computation time.
Kept model architecture unchanged to maintain accuracy.
Results Interpretation

Before optimization: Latency = 5 seconds, Accuracy = 92%

After optimization: Latency = 1.8 seconds, Accuracy = 91%

Reducing latency can be achieved by limiting output length and simplifying generation steps, which speeds up response time with minimal accuracy loss.
Bonus Experiment
Try using model pruning to remove less important parts of the model and see if latency improves further without dropping accuracy below 90%.
💡 Hint
Use pruning libraries or frameworks that support your model and test inference speed and accuracy after pruning.