How to Load Documents in Langchain: Simple Guide
To load documents in
Langchain, use one of its built-in document loaders like TextLoader or UnstructuredPDFLoader. These loaders read files and convert them into Document objects that Langchain can process.Syntax
Langchain provides various document loaders to read different file types. The general syntax is:
Loader = SomeLoader(file_path): Create a loader instance with the file path.documents = Loader.load(): Load and return a list ofDocumentobjects.
Each loader is designed for specific file formats like text, PDF, or HTML.
python
from langchain.document_loaders import TextLoader loader = TextLoader("example.txt") documents = loader.load()
Example
This example shows how to load a plain text file using TextLoader and print the content of the first document.
python
from langchain.document_loaders import TextLoader # Create a loader for a text file loader = TextLoader("sample.txt") # Load documents documents = loader.load() # Print the content of the first document print(documents[0].page_content)
Output
This is the content of the sample.txt file.
Common Pitfalls
Common mistakes when loading documents in Langchain include:
- Using the wrong loader for the file type (e.g., using
TextLoaderfor PDFs). - Not providing the correct file path or missing files.
- Forgetting to call
load()method to actually read the documents.
Always check the file format and use the matching loader.
python
from langchain.document_loaders import TextLoader, UnstructuredPDFLoader # Wrong way: Using TextLoader for a PDF file loader_wrong = TextLoader("document.pdf") documents_wrong = loader_wrong.load() # This will not load PDF correctly # Right way: Use UnstructuredPDFLoader for PDFs loader_right = UnstructuredPDFLoader("document.pdf") documents_right = loader_right.load()
Quick Reference
Here is a quick reference for common Langchain document loaders:
| Loader | File Type | Description |
|---|---|---|
| TextLoader | Plain text (.txt) | Loads plain text files as documents. |
| UnstructuredPDFLoader | PDF (.pdf) | Loads PDF files using unstructured data parsing. |
| CSVLoader | CSV (.csv) | Loads CSV files, parsing rows as documents. |
| HTMLLoader | HTML (.html) | Loads HTML files, extracting text content. |
Key Takeaways
Use the correct Langchain loader for your document file type to ensure proper loading.
Always call the load() method on the loader instance to get Document objects.
Check file paths carefully to avoid file not found errors.
Langchain supports many loaders like TextLoader for text and UnstructuredPDFLoader for PDFs.
Loaded documents are returned as a list of Document objects with accessible content.