How to Stream Responses in Langchain: Simple Guide
To stream responses in
langchain, enable the streaming parameter in your language model setup and provide a callback_manager to handle tokens as they arrive. This allows your app to receive and display output incrementally instead of waiting for the full response.Syntax
Streaming in Langchain requires setting streaming=True when creating the language model instance. You also need to pass a callback_manager that listens for new tokens and processes them as they stream in.
Key parts:
streaming=True: Enables streaming mode.callback_manager: Manages callbacks for token events.on_llm_new_token: Function called for each new token received.
python
from langchain.chat_models import ChatOpenAI from langchain.callbacks.base import BaseCallbackHandler from langchain.callbacks.manager import CallbackManager class StreamHandler(BaseCallbackHandler): def on_llm_new_token(self, token: str, **kwargs) -> None: print(token, end='', flush=True) handler = StreamHandler() callback_manager = CallbackManager([handler]) chat = ChatOpenAI(streaming=True, callback_manager=callback_manager) response = chat.predict('Say hello in a streaming way.')
Output
Say hello in a streaming way.
Example
This example shows how to create a streaming chat model with Langchain that prints tokens as they arrive. The StreamHandler class handles each new token by printing it immediately, simulating real-time output.
python
from langchain.chat_models import ChatOpenAI from langchain.callbacks.base import BaseCallbackHandler from langchain.callbacks.manager import CallbackManager class StreamHandler(BaseCallbackHandler): def on_llm_new_token(self, token: str, **kwargs) -> None: print(token, end='', flush=True) handler = StreamHandler() callback_manager = CallbackManager([handler]) chat = ChatOpenAI(streaming=True, callback_manager=callback_manager) response = chat.predict('Explain streaming responses simply.')
Output
Explain streaming responses simply.
Common Pitfalls
Common mistakes when streaming responses in Langchain include:
- Not setting
streaming=True, so the model waits to return the full response. - Omitting the
callback_manager, so no tokens are received incrementally. - Using a callback handler that does not implement
on_llm_new_token, missing token events.
Always ensure your callback handler properly processes tokens and that streaming is enabled.
python
from langchain.chat_models import ChatOpenAI # Wrong: streaming disabled, no callbacks chat = ChatOpenAI() response = chat.predict('Hello') # waits for full response # Right: streaming enabled with callback from langchain.callbacks.base import BaseCallbackHandler from langchain.callbacks.manager import CallbackManager class StreamHandler(BaseCallbackHandler): def on_llm_new_token(self, token: str, **kwargs) -> None: print(token, end='', flush=True) handler = StreamHandler() callback_manager = CallbackManager([handler]) chat = ChatOpenAI(streaming=True, callback_manager=callback_manager) response = chat.predict('Hello')
Output
Hello
Quick Reference
| Feature | Description |
|---|---|
| streaming=True | Enables streaming mode for incremental output |
| callback_manager | Handles events like new tokens during streaming |
| on_llm_new_token(token) | Callback method called for each new token |
| Print tokens immediately | Use print with flush=True for real-time display |
Key Takeaways
Enable streaming by setting streaming=True in your Langchain model.
Use a callback_manager with a handler implementing on_llm_new_token to process tokens as they arrive.
Streaming lets you show output token-by-token instead of waiting for the full response.
Without streaming or callbacks, Langchain returns the full response only after completion.
Print tokens with flush=True to see them appear immediately in your app or console.