How to load website in langchain

LangchainHow-ToBeginner · 3 min read

How to Load a Website in LangChain: Simple Guide

To load a website in LangChain, use the WebBaseLoader class which fetches and parses the webpage content. Instantiate it with the website URL and call load() to get the text content ready for processing.

📐

Syntax

The WebBaseLoader class in LangChain loads website content by fetching the HTML and extracting readable text.

Key parts:

WebBaseLoader(url): Create a loader for the given website URL.
load(): Fetch and return the website's text content as documents.

python

from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://example.com")
documents = loader.load()

💻

Example

This example shows how to load the text content from a website using LangChain's WebBaseLoader. It prints the first 500 characters of the loaded content.

python

from langchain.document_loaders import WebBaseLoader

# Create a loader for the website URL
loader = WebBaseLoader("https://www.example.com")

# Load the website content
documents = loader.load()

# Print the first 500 characters of the content
print(documents[0].page_content[:500])

Output

<!doctype html>\n<html>\n<head>\n <title>Example Domain</title>\n <meta charset="utf-8" />\n <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n <meta name="viewport" content="width=device-width, initial-scale=1" />\n</head>\n<body>\n<div>\n <h1>Example Domain</h1>\n <p>This domain is established to be used for illustrative examples in documents.</p>

⚠️

Common Pitfalls

Common mistakes when loading websites in LangChain include:

Using URLs that block bots or require authentication, causing load() to fail.
Expecting WebBaseLoader to handle JavaScript-rendered content; it only fetches static HTML.
Not handling network errors or timeouts when fetching the website.

Always verify the URL is accessible and consider using other loaders or tools for dynamic content.

python

from langchain.document_loaders import WebBaseLoader

# Wrong: Using a URL that requires login or blocks scraping
loader = WebBaseLoader("https://example.com/private")
documents = loader.load()  # This may fail or return empty

# Right: Use a public, accessible URL
loader = WebBaseLoader("https://www.example.com")
documents = loader.load()

📊

Quick Reference

WebBaseLoader(url): Initialize with website URL.
load(): Fetch and parse website content.
Use for static HTML pages only.
Handle exceptions for network issues.

✅

Key Takeaways

Use WebBaseLoader with the website URL to load static HTML content in LangChain.

Call load() on the loader to get the website text as documents.

WebBaseLoader does not support JavaScript-rendered pages; use other tools if needed.

Ensure the website is publicly accessible to avoid loading errors.

Handle network errors gracefully when fetching website content.