What is Custom document loaders in LangChain?

LangChainframework~5 mins

Custom document loaders in LangChain

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Perf

Introduction

Custom document loaders help you bring in data from places not covered by built-in loaders. They let you read and prepare your own files or sources for your app.

You have files in a special format that no default loader supports.

You want to load documents from a private database or API.

You need to preprocess or clean data before using it.

You want to combine multiple sources into one loader.

You want to add custom metadata while loading documents.

Syntax

LangChain

from langchain.document_loaders import BaseLoader

class MyLoader(BaseLoader):
    def __init__(self, source_path: str):
        self.source_path = source_path

    def load(self):
        # read and process your data here
        documents = []
        with open(self.source_path, 'r', encoding='utf-8') as f:
            text = f.read()
            # create Document objects or dicts
            documents.append({'page_content': text, 'metadata': {}})
        return documents

Custom loaders must inherit from BaseLoader and implement a load method.

The load method returns a list of documents, each with content and optional metadata.

Examples

Loads a plain text file and returns its content as one document.

LangChain

class TxtLoader(BaseLoader):
    def __init__(self, filepath):
        self.filepath = filepath

    def load(self):
        with open(self.filepath, 'r', encoding='utf-8') as f:
            text = f.read()
        return [{'page_content': text, 'metadata': {}}]

Loads JSON data and creates documents from each item with metadata.

LangChain

class JsonLoader(BaseLoader):
    def __init__(self, filepath):
        self.filepath = filepath

    def load(self):
        import json
        with open(self.filepath, 'r', encoding='utf-8') as f:
            data = json.load(f)
        docs = []
        for item in data['items']:
            docs.append({'page_content': item['text'], 'metadata': {'id': item['id']}})
        return docs

Sample Program

This example shows a simple custom loader that reads a text file and adds the file path as metadata. It prints the content and metadata of the loaded document.

LangChain

from langchain.document_loaders import BaseLoader

class SimpleTxtLoader(BaseLoader):
    def __init__(self, filepath):
        self.filepath = filepath

    def load(self):
        with open(self.filepath, 'r', encoding='utf-8') as f:
            text = f.read()
        return [{'page_content': text, 'metadata': {'source': self.filepath}}]

# Usage example
loader = SimpleTxtLoader('example.txt')
docs = loader.load()
for doc in docs:
    print(f"Content:\n{doc['page_content']}")
    print(f"Metadata: {doc['metadata']}")

OutputSuccess

Important Notes

Make sure your custom loader handles file encoding and errors gracefully.

Adding metadata helps track where documents come from and can be useful later.

Test your loader with different inputs to ensure it works as expected.

Summary

Custom document loaders let you bring in data from any source you want.

They must inherit from BaseLoader and implement a load method.

Use them to read, clean, and add metadata to your documents before using them.