How to Load a Website in LangChain: Simple Guide
To load a website in
LangChain, use the WebBaseLoader class which fetches and parses the webpage content. Instantiate it with the website URL and call load() to get the text content ready for processing.Syntax
The WebBaseLoader class in LangChain loads website content by fetching the HTML and extracting readable text.
Key parts:
WebBaseLoader(url): Create a loader for the given website URL.load(): Fetch and return the website's text content as documents.
python
from langchain.document_loaders import WebBaseLoader loader = WebBaseLoader("https://example.com") documents = loader.load()
Example
This example shows how to load the text content from a website using LangChain's WebBaseLoader. It prints the first 500 characters of the loaded content.
python
from langchain.document_loaders import WebBaseLoader # Create a loader for the website URL loader = WebBaseLoader("https://www.example.com") # Load the website content documents = loader.load() # Print the first 500 characters of the content print(documents[0].page_content[:500])
Output
<!doctype html>\n<html>\n<head>\n <title>Example Domain</title>\n <meta charset="utf-8" />\n <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n <meta name="viewport" content="width=device-width, initial-scale=1" />\n</head>\n<body>\n<div>\n <h1>Example Domain</h1>\n <p>This domain is established to be used for illustrative examples in documents.</p>
Common Pitfalls
Common mistakes when loading websites in LangChain include:
- Using URLs that block bots or require authentication, causing
load()to fail. - Expecting
WebBaseLoaderto handle JavaScript-rendered content; it only fetches static HTML. - Not handling network errors or timeouts when fetching the website.
Always verify the URL is accessible and consider using other loaders or tools for dynamic content.
python
from langchain.document_loaders import WebBaseLoader # Wrong: Using a URL that requires login or blocks scraping loader = WebBaseLoader("https://example.com/private") documents = loader.load() # This may fail or return empty # Right: Use a public, accessible URL loader = WebBaseLoader("https://www.example.com") documents = loader.load()
Quick Reference
- WebBaseLoader(url): Initialize with website URL.
- load(): Fetch and parse website content.
- Use for static HTML pages only.
- Handle exceptions for network issues.
Key Takeaways
Use WebBaseLoader with the website URL to load static HTML content in LangChain.
Call load() on the loader to get the website text as documents.
WebBaseLoader does not support JavaScript-rendered pages; use other tools if needed.
Ensure the website is publicly accessible to avoid loading errors.
Handle network errors gracefully when fetching website content.