Prompt Engineering / GenAIml~20 mins

Text splitters in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Text splitters

Problem:You have a long text document that you want to split into smaller chunks for easier processing by a language model. Currently, the text splitter divides the text into chunks of fixed size without considering sentence boundaries.

Current Metrics:Chunk size: 1000 characters; Overlap: 0; Number of chunks: 15; Issue: Some chunks cut sentences in half, causing loss of context.

Issue:The fixed-size splitter cuts sentences abruptly, which can confuse the language model and reduce the quality of downstream tasks like summarization or question answering.

Your Task

Improve the text splitter to split text into chunks that respect sentence boundaries, reducing sentence cuts and improving context preservation.

You must keep chunk size approximately 1000 characters.

You can add overlap between chunks if needed.

Use only Python standard libraries or common NLP libraries like NLTK or spaCy.

Hint 1

Hint 2

Hint 3

Solution

Prompt Engineering / GenAI

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

def split_text_into_chunks(text, max_chunk_size=1000, overlap=100):
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = ''
    for sentence in sentences:
        if len(current_chunk) + len(sentence) + 1 <= max_chunk_size:
            current_chunk += ' ' + sentence if current_chunk else sentence
        else:
            if current_chunk:
                chunks.append(current_chunk)
            # Start new chunk with overlap
            overlap_text = current_chunk[-overlap:] if overlap < len(current_chunk) else current_chunk
            current_chunk = overlap_text + ' ' + sentence
    if current_chunk:
        chunks.append(current_chunk)
    return chunks

# Example usage
text = """Natural language processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and humans through natural language. The ultimate objective of NLP is to read, decipher, understand, and make sense of human languages in a valuable way. Many challenges in NLP involve natural language understanding, natural language generation, and speech recognition."

chunks = split_text_into_chunks(text)
for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i} (length {len(chunk)}):\n{chunk}\n")

Replaced fixed-size character splitting with sentence tokenization using NLTK.

Grouped sentences into chunks without breaking sentences.

Added overlap of 100 characters between chunks to preserve context.

Results Interpretation

Before: 15 chunks of fixed 1000 characters, sentences cut in half causing context loss.
After: ~15 chunks with sentence boundaries respected, overlap added, better context preservation.

Splitting text by sentence boundaries and adding overlap helps maintain context and improves downstream language model tasks.

Bonus Experiment

Try splitting text using a different NLP library like spaCy and compare the chunk quality and number of chunks.

💡 Hint

Use spaCy's sentence segmentation and implement similar chunk grouping logic.

Practice

(1/5)

1. What is the main purpose of a text splitter in AI applications?

easy

A. To translate text into different languages

B. To generate new text from a prompt

C. To break long text into smaller, manageable pieces

D. To summarize text into a single sentence

Text splitters in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand the role of text splitters

Step 2: Compare options to the definition

Final Answer:

Quick Check:

Solution

Step 1: Identify correct parameter names and types

Step 2: Check each option for syntax and type correctness

Final Answer:

Quick Check:

Solution

Step 1: Understand chunk size and overlap effect

Step 2: Apply splitting logic to the text

Final Answer:

Quick Check:

Solution

Step 1: Check parameter types

Step 2: Validate other options

Final Answer:

Quick Check:

Solution

Step 1: Understand model token limit and overlap purpose

Step 2: Evaluate options for chunk size and overlap

Final Answer:

Quick Check: