LangChainframework~30 mins

Metadata preservation during splitting in LangChain - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Perf

Metadata preservation during splitting

📖 Scenario: You are building a document processing tool using LangChain. You have a document with text and metadata. You want to split the document into smaller chunks but keep the metadata attached to each chunk.

🎯 Goal: Create a Python script that uses LangChain's CharacterTextSplitter to split a document while preserving its metadata in each chunk.

📋 What You'll Learn

Create a Document object with specific text and metadata

Create a CharacterTextSplitter with a chunk size of 10

Use the splitter to split the document into chunks

Ensure each chunk keeps the original metadata

💡 Why This Matters

🌍 Real World

When processing large documents for search or analysis, splitting text into smaller parts while keeping metadata helps maintain context and source information.

💼 Career

This skill is useful for developers working on document processing, search engines, chatbots, or any application that handles large text data with metadata.

Progress0 / 4 steps

Create the initial Document with text and metadata

Create a Document object named doc with the text 'Hello world! This is a test document.' and metadata {'source': 'test_source'}.

LangChain

# Create the Document object named doc with text and metadata
# Your code here

Need a hint?

Use Document(page_content=..., metadata=...) to create the document.

Create a CharacterTextSplitter with chunk size 10

Create a CharacterTextSplitter object named splitter with chunk_size=10.

LangChain

from langchain.schema import Document

# Create the Document object named doc with text and metadata
doc = Document(
    page_content='Hello world! This is a test document.',
    metadata={'source': 'test_source'}
)

# Create the CharacterTextSplitter named splitter with chunk_size=10
# Your code here

Need a hint?

Import CharacterTextSplitter from langchain.text_splitter and set chunk_size=10.

Split the document into chunks preserving metadata

Use splitter.split_documents with a list containing doc to create a variable chunks.

LangChain

from langchain.schema import Document
from langchain.text_splitter import CharacterTextSplitter

doc = Document(
    page_content='Hello world! This is a test document.',
    metadata={'source': 'test_source'}
)

splitter = CharacterTextSplitter(chunk_size=10)

# Split the document into chunks and store in chunks
# Your code here

Need a hint?

Call split_documents on a list with doc inside.

Verify each chunk preserves the original metadata

Add a for loop iterating over chunks with variable chunk. Inside the loop, assign chunk.metadata to a variable meta.

LangChain

from langchain.schema import Document
from langchain.text_splitter import CharacterTextSplitter

doc = Document(
    page_content='Hello world! This is a test document.',
    metadata={'source': 'test_source'}
)

splitter = CharacterTextSplitter(chunk_size=10)

chunks = splitter.split_documents([doc])

# Loop over chunks and assign chunk.metadata to meta
# Your code here

Need a hint?

Use for chunk in chunks: and inside assign meta = chunk.metadata.