Introduction
Imagine waiting for a long message to arrive all at once. Streaming responses solve this by sending parts of the message as soon as they are ready, so you start seeing the answer without delay.
Jump into concepts and practice - no test required
Imagine watching a movie online that starts playing while the rest is still downloading. You don’t wait for the full movie to download before watching; it streams so you see it bit by bit.
┌───────────────┐ ┌───────────────┐ │ Server │──────▶│ Client │ │ (prepares │ │ (receives and │ │ response) │ │ displays │ │ in chunks) │ │ chunks live) │ └───────────────┘ └───────────────┘ ▲ ▲ │ │ └────────────── Continuous connection ──────────────┘
stream=True.response = model.generate(prompt, stream=True)
for chunk in response:
print(chunk)stream=True, the response is an iterable that yields chunks as they arrive.response = model.generate(prompt, stream=True) print(response)
stream=True, the response is an iterable, not a complete string.stream=True allows receiving data chunks as they are generated.stream=True but collecting all chunks in a list before printing defeats real-time display. Setting stream=False waits for the full response. Using a timer without streaming is inefficient.