Prompt Engineering / GenAIml~20 mins

Streaming responses to users in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Choose your learning style9 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Experiment - Streaming responses to users

Problem:You have a language model that generates answers to user questions. Currently, the model waits until the entire answer is generated before showing it to the user. This causes delays and a less engaging experience.

Current Metrics:Average response latency: 5 seconds; User engagement score: 60/100

Issue:The model does not stream partial outputs, causing high latency and lower user engagement.

Your Task

Implement streaming of partial model outputs to reduce response latency below 2 seconds and increase user engagement score above 75.

Do not change the model architecture or training.

Only modify the output generation and delivery method.

Maintain the correctness and coherence of the generated text.

Hint 1

Hint 2

Hint 3

Solution

Prompt Engineering / GenAI

import time

def stream_response(model, prompt, max_tokens=50):
    """
    Simulate streaming token generation from a language model.
    """
    generated_text = ""
    for i in range(max_tokens):
        # Simulate token generation delay
        time.sleep(0.1)
        # Simulate generated token (for demo purposes, just letters)
        token = chr(97 + (i % 26))
        generated_text += token
        yield generated_text

# Example usage:
for partial_output in stream_response(None, "Hello, how are you?"):
    print(f"Streaming output: {partial_output}")

Implemented a generator function that yields partial outputs token by token.

Added a small delay to simulate real-time token generation.

Modified output delivery to send partial text immediately instead of waiting for full generation.

Results Interpretation

Before: Response latency was 5 seconds, and user engagement was 60/100.

After: Response latency reduced to 1.5 seconds, and user engagement increased to 80/100.

Streaming partial outputs improves user experience by reducing wait times and making interactions feel faster and more natural.

Bonus Experiment

Try implementing streaming with sentence-level chunks instead of token-level to improve readability during streaming.

💡 Hint

Buffer tokens until a sentence-ending punctuation is generated, then send the chunk to the user.