Bird
Raised Fist0
Prompt Engineering / GenAIml~20 mins

Text splitters in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Text splitters
Problem:You have a long text document that you want to split into smaller chunks for easier processing by a language model. Currently, the text splitter divides the text into chunks of fixed size without considering sentence boundaries.
Current Metrics:Chunk size: 1000 characters; Overlap: 0; Number of chunks: 15; Issue: Some chunks cut sentences in half, causing loss of context.
Issue:The fixed-size splitter cuts sentences abruptly, which can confuse the language model and reduce the quality of downstream tasks like summarization or question answering.
Your Task
Improve the text splitter to split text into chunks that respect sentence boundaries, reducing sentence cuts and improving context preservation.
You must keep chunk size approximately 1000 characters.
You can add overlap between chunks if needed.
Use only Python standard libraries or common NLP libraries like NLTK or spaCy.
Hint 1
Hint 2
Hint 3
Solution
Prompt Engineering / GenAI
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

def split_text_into_chunks(text, max_chunk_size=1000, overlap=100):
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = ''
    for sentence in sentences:
        if len(current_chunk) + len(sentence) + 1 <= max_chunk_size:
            current_chunk += ' ' + sentence if current_chunk else sentence
        else:
            if current_chunk:
                chunks.append(current_chunk)
            # Start new chunk with overlap
            overlap_text = current_chunk[-overlap:] if overlap < len(current_chunk) else current_chunk
            current_chunk = overlap_text + ' ' + sentence
    if current_chunk:
        chunks.append(current_chunk)
    return chunks

# Example usage
text = """Natural language processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and humans through natural language. The ultimate objective of NLP is to read, decipher, understand, and make sense of human languages in a valuable way. Many challenges in NLP involve natural language understanding, natural language generation, and speech recognition."

chunks = split_text_into_chunks(text)
for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i} (length {len(chunk)}):\n{chunk}\n")
Replaced fixed-size character splitting with sentence tokenization using NLTK.
Grouped sentences into chunks without breaking sentences.
Added overlap of 100 characters between chunks to preserve context.
Results Interpretation

Before: 15 chunks of fixed 1000 characters, sentences cut in half causing context loss.
After: ~15 chunks with sentence boundaries respected, overlap added, better context preservation.

Splitting text by sentence boundaries and adding overlap helps maintain context and improves downstream language model tasks.
Bonus Experiment
Try splitting text using a different NLP library like spaCy and compare the chunk quality and number of chunks.
💡 Hint
Use spaCy's sentence segmentation and implement similar chunk grouping logic.

Practice

(1/5)
1. What is the main purpose of a text splitter in AI applications?
easy
A. To translate text into different languages
B. To generate new text from a prompt
C. To break long text into smaller, manageable pieces
D. To summarize text into a single sentence

Solution

  1. Step 1: Understand the role of text splitters

    Text splitters are designed to divide long text into smaller parts for easier processing.
  2. Step 2: Compare options to the definition

    Only To break long text into smaller, manageable pieces describes breaking text into smaller pieces, which matches the purpose of text splitters.
  3. Final Answer:

    To break long text into smaller, manageable pieces -> Option C
  4. Quick Check:

    Text splitter purpose = break text [OK]
Hint: Text splitters cut text into chunks for easier handling [OK]
Common Mistakes:
  • Confusing splitting with translation
  • Thinking splitters summarize text
  • Assuming splitters generate new text
2. Which of the following is the correct way to set chunk size and overlap in a text splitter?
easy
A. chunk_size='100', overlap=20
B. chunkSize=100, overlap=20
C. chunk_size=100, overlap=twenty
D. chunk_size=100, overlap=20

Solution

  1. Step 1: Identify correct parameter names and types

    Parameters should be named with underscores and numeric values for size and overlap.
  2. Step 2: Check each option for syntax and type correctness

    chunk_size=100, overlap=20 uses correct parameter names and numeric values; others have wrong names or types.
  3. Final Answer:

    chunk_size=100, overlap=20 -> Option D
  4. Quick Check:

    Correct param names and numeric values = chunk_size=100, overlap=20 [OK]
Hint: Use underscores and numbers for chunk size and overlap [OK]
Common Mistakes:
  • Using camelCase instead of snake_case
  • Passing string instead of number for overlap
  • Misspelling parameter names
3. Given the following Python code using a text splitter:
text = "Hello world! This is a test of text splitting."
chunk_size = 12
overlap = 4
splitter = TextSplitter(chunk_size=chunk_size, overlap=overlap)
chunks = splitter.split(text)
print(chunks)

What is the expected output?
medium
A. ["Hello world!", "world! This is", "This is a test", "a test of text", "of text splitting."]
B. ["Hello world! This", "This is a test of", "test of text splitting."]
C. ["Hello world! This is a test of text splitting."]
D. ["Hello", "world!", "This", "is", "a", "test", "of", "text", "splitting."]

Solution

  1. Step 1: Understand chunk size and overlap effect

    Chunk size 12 means each piece has up to 12 characters; overlap 4 means next chunk starts 4 characters before previous ends.
  2. Step 2: Apply splitting logic to the text

    Chunks are: "Hello world!" (12 chars), then start 8 chars in (12-4=8) at "world! This is", and so on, producing the listed chunks in ["Hello world!", "world! This is", "This is a test", "a test of text", "of text splitting."].
  3. Final Answer:

    ["Hello world!", "world! This is", "This is a test", "a test of text", "of text splitting."] -> Option A
  4. Quick Check:

    Chunk size 12 + overlap 4 = overlapping chunks [OK]
Hint: Chunk size limits length; overlap repeats last part [OK]
Common Mistakes:
  • Ignoring overlap and making chunks non-overlapping
  • Using wrong chunk sizes
  • Returning entire text as one chunk
4. Consider this code snippet that tries to split text but raises an error:
text = "Sample text for splitting."
splitter = TextSplitter(chunk_size='10', overlap=3)
chunks = splitter.split(text)

What is the most likely cause of the error?
medium
A. chunk_size should be an integer, not a string
B. overlap cannot be less than 5
C. TextSplitter requires a minimum chunk_size of 20
D. The text variable is too short to split

Solution

  1. Step 1: Check parameter types

    chunk_size is given as a string '10' instead of an integer 10, which causes a type error.
  2. Step 2: Validate other options

    Overlap 3 is valid; no minimum chunk size of 20 is required; text length is sufficient.
  3. Final Answer:

    chunk_size should be an integer, not a string -> Option A
  4. Quick Check:

    Parameter type mismatch = chunk_size should be an integer, not a string [OK]
Hint: Use numbers, not strings, for chunk size and overlap [OK]
Common Mistakes:
  • Passing chunk_size as string
  • Assuming overlap minimum is 5
  • Thinking text length causes error
5. You have a very long document and want to split it for an AI model that can only process 500 tokens at a time. You want some context overlap to keep meaning. Which approach best balances chunk size and overlap?
hard
A. Set chunk_size to 600 tokens and overlap to 0 tokens
B. Set chunk_size to 500 tokens and overlap to 100 tokens
C. Set chunk_size to 400 tokens and overlap to 200 tokens
D. Set chunk_size to 100 tokens and overlap to 50 tokens

Solution

  1. Step 1: Understand model token limit and overlap purpose

    The model can process 500 tokens max; overlap adds repeated context to help understanding.
  2. Step 2: Evaluate options for chunk size and overlap

    Set chunk_size to 500 tokens and overlap to 100 tokens uses chunk size 500 (max allowed) and overlap 100 (reasonable context). Others exceed limit or have too small chunks.
  3. Final Answer:

    Set chunk_size to 500 tokens and overlap to 100 tokens -> Option B
  4. Quick Check:

    Chunk size ≤ 500 with overlap for context = Set chunk_size to 500 tokens and overlap to 100 tokens [OK]
Hint: Keep chunk size at max limit, add moderate overlap [OK]
Common Mistakes:
  • Exceeding model token limit
  • Setting overlap too large or zero
  • Using very small chunk sizes unnecessarily