When you split a large document into smaller parts, you want to keep important extra information (metadata) with each part. This helps you remember details like the source or author for each piece.
Metadata preservation during splitting in LangChain
from langchain.text_splitter import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100) docs = text_splitter.split_documents(documents)
split_documents takes a list of documents with metadata and returns smaller documents keeping the metadata.
You can customize chunk_size and chunk_overlap to control how big each piece is and how much they share.
from langchain.schema import Document from langchain.text_splitter import RecursiveCharacterTextSplitter # Original document with metadata original_doc = Document(page_content="Hello world! This is a test.", metadata={"source": "file1.txt"}) # Splitter setup splitter = RecursiveCharacterTextSplitter(chunk_size=5, chunk_overlap=0) # Split document chunks = splitter.split_documents([original_doc])
from langchain.schema import Document from langchain.text_splitter import RecursiveCharacterTextSplitter # Document with author metadata doc = Document(page_content="This is a longer text that will be split.", metadata={"author": "Alice"}) splitter = RecursiveCharacterTextSplitter(chunk_size=10, chunk_overlap=2) chunks = splitter.split_documents([doc])
This program splits a document into smaller parts, each keeping the original metadata like source and page number. It prints each chunk's text and metadata.
from langchain.schema import Document from langchain.text_splitter import RecursiveCharacterTextSplitter # Create a document with metadata text = "Langchain helps you build apps with language models." metadata = {"source": "example.txt", "page": 1} doc = Document(page_content=text, metadata=metadata) # Setup splitter splitter = RecursiveCharacterTextSplitter(chunk_size=20, chunk_overlap=5) # Split document chunks = splitter.split_documents([doc]) # Print each chunk's content and metadata for i, chunk in enumerate(chunks, 1): print(f"Chunk {i} content: {chunk.page_content}") print(f"Chunk {i} metadata: {chunk.metadata}\n")
Always check that metadata is preserved after splitting to avoid losing important info.
Metadata can include anything like source, author, date, or tags.
Adjust chunk size and overlap to balance between chunk length and context preservation.
Splitting documents keeps metadata so you don't lose context.
Use split_documents method to split while preserving metadata.
Customize chunk size and overlap to fit your needs.