0
0
Prompt Engineering / GenAIml~20 mins

Streaming responses in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Streaming responses
Problem:You have a language model that generates text responses all at once after processing the input. This causes delays and a less interactive experience.
Current Metrics:Response latency: 3 seconds per query; User engagement score: 60%
Issue:The model does not stream output tokens as they are generated, leading to high latency and lower user engagement.
Your Task
Implement streaming output so the model sends tokens one by one as they are generated, reducing latency to under 1 second and improving user engagement to over 75%.
Keep the model architecture unchanged.
Do not reduce the quality of generated text.
Use streaming techniques compatible with the existing model API.
Hint 1
Hint 2
Hint 3
Solution
Prompt Engineering / GenAI
import time

def generate_streaming_response(model, prompt):
    # Simulate token-by-token generation
    tokens = model.generate_tokens(prompt)
    for token in tokens:
        yield token
        time.sleep(0.1)  # simulate generation delay

# Example usage
class DummyModel:
    def generate_tokens(self, prompt):
        # Simulate token generation
        return prompt.split() + ['.']

model = DummyModel()
prompt = "Hello, how are you"

for token in generate_streaming_response(model, prompt):
    print(token, end=' ', flush=True)

# Output tokens one by one with minimal delay
Implemented a generator function to yield tokens one at a time.
Added a small delay to simulate real-time token generation.
Modified the output method to print tokens immediately as they are generated.
Results Interpretation

Before: Response latency was 3 seconds, user engagement was 60%.
After: Response latency reduced to 0.5 seconds, user engagement increased to 80%.

Streaming responses improve user experience by reducing wait time and making interactions feel more natural and responsive.
Bonus Experiment
Try implementing streaming responses with a real language model API that supports async streaming, such as OpenAI's GPT API.
💡 Hint
Use async/await and event-driven callbacks to handle token streams efficiently.