0
0
Prompt Engineering / GenAIml~20 mins

Text splitters in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Text splitters
Problem:You have a long text document that you want to split into smaller chunks for easier processing by a language model. Currently, the text splitter divides the text into chunks of fixed size without considering sentence boundaries.
Current Metrics:Chunk size: 1000 characters; Overlap: 0; Number of chunks: 15; Issue: Some chunks cut sentences in half, causing loss of context.
Issue:The fixed-size splitter cuts sentences abruptly, which can confuse the language model and reduce the quality of downstream tasks like summarization or question answering.
Your Task
Improve the text splitter to split text into chunks that respect sentence boundaries, reducing sentence cuts and improving context preservation.
You must keep chunk size approximately 1000 characters.
You can add overlap between chunks if needed.
Use only Python standard libraries or common NLP libraries like NLTK or spaCy.
Hint 1
Hint 2
Hint 3
Solution
Prompt Engineering / GenAI
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

def split_text_into_chunks(text, max_chunk_size=1000, overlap=100):
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = ''
    for sentence in sentences:
        if len(current_chunk) + len(sentence) + 1 <= max_chunk_size:
            current_chunk += ' ' + sentence if current_chunk else sentence
        else:
            if current_chunk:
                chunks.append(current_chunk)
            # Start new chunk with overlap
            overlap_text = current_chunk[-overlap:] if overlap < len(current_chunk) else current_chunk
            current_chunk = overlap_text + ' ' + sentence
    if current_chunk:
        chunks.append(current_chunk)
    return chunks

# Example usage
text = """Natural language processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and humans through natural language. The ultimate objective of NLP is to read, decipher, understand, and make sense of human languages in a valuable way. Many challenges in NLP involve natural language understanding, natural language generation, and speech recognition."

chunks = split_text_into_chunks(text)
for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i} (length {len(chunk)}):\n{chunk}\n")
Replaced fixed-size character splitting with sentence tokenization using NLTK.
Grouped sentences into chunks without breaking sentences.
Added overlap of 100 characters between chunks to preserve context.
Results Interpretation

Before: 15 chunks of fixed 1000 characters, sentences cut in half causing context loss.
After: ~15 chunks with sentence boundaries respected, overlap added, better context preservation.

Splitting text by sentence boundaries and adding overlap helps maintain context and improves downstream language model tasks.
Bonus Experiment
Try splitting text using a different NLP library like spaCy and compare the chunk quality and number of chunks.
💡 Hint
Use spaCy's sentence segmentation and implement similar chunk grouping logic.