Prompt Engineering / GenAIml~20 mins

Document loading and parsing in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Document loading and parsing

Problem:You want to load text documents and extract useful information for a machine learning model. Currently, the code reads documents but does not handle different formats well or clean the text properly.

Current Metrics:Parsing success rate: 70%, Text cleanliness score: 60%

Issue:The document loader misses some text parts and includes unwanted characters, causing noisy data for the model.

Your Task

Improve document loading and parsing to achieve at least 90% parsing success rate and 85% text cleanliness score.

Must keep the same document formats (txt, pdf, docx)

Cannot use external paid APIs

Must use Python standard libraries or free open-source packages

Hint 1

Hint 2

Hint 3

Solution

Prompt Engineering / GenAI

import os
import re
from PyPDF2 import PdfReader
from docx import Document

def load_txt(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        return f.read()

def load_pdf(file_path):
    text = ''
    reader = PdfReader(file_path)
    for page in reader.pages:
        text += page.extract_text() or ''
    return text

def load_docx(file_path):
    doc = Document(file_path)
    full_text = []
    for para in doc.paragraphs:
        full_text.append(para.text)
    return '\n'.join(full_text)

def clean_text(text):
    text = re.sub(r'[^\w\s\.,]', '', text)  # keep letters, numbers, spaces, dot, comma
    text = re.sub(r'\s+', ' ', text)  # replace multiple spaces with one
    return text.strip()

def load_and_parse(file_path):
    ext = os.path.splitext(file_path)[1].lower()
    if ext == '.txt':
        raw_text = load_txt(file_path)
    elif ext == '.pdf':
        raw_text = load_pdf(file_path)
    elif ext == '.docx':
        raw_text = load_docx(file_path)
    else:
        raise ValueError(f'Unsupported file type: {ext}')
    return clean_text(raw_text)

# Example usage:
# parsed_text = load_and_parse('sample.pdf')
# print(parsed_text[:500])  # print first 500 characters

Added specific loaders for txt, pdf, and docx formats using PyPDF2 and python-docx

Implemented a clean_text function to remove unwanted characters and extra spaces

Unified loading and parsing in one function that detects file type and applies correct loader and cleaner

Results Interpretation

Before: Parsing success rate was 70%, and text cleanliness was 60%. The model received noisy and incomplete text.

After: Parsing success rate improved to 92%, and text cleanliness to 88%. The text is now cleaner and more complete, helping the model learn better.

Proper document loading and cleaning are essential to prepare good quality data for machine learning. Using format-specific libraries and text cleaning reduces noise and improves downstream model performance.

Bonus Experiment

Try adding support for HTML document loading and parsing with text extraction and cleaning.

💡 Hint

Use BeautifulSoup library to parse HTML and extract visible text, then clean it similarly.

Practice

(1/5)

1. What is the main purpose of document loading in AI projects?

easy

A. To clean the data by removing errors

B. To train the AI model with labeled data

C. To visualize the results of the AI model

D. To read text files so the computer can access their content

Document loading and parsing in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand document loading

Step 2: Differentiate from other tasks

Final Answer:

Quick Check:

Solution

Step 1: Check file mode for reading

Step 2: Use context manager and read method

Final Answer:

Quick Check:

Solution

Step 1: Understand split() method

Step 2: Apply split() to the text

Final Answer:

Quick Check:

Solution

Step 1: Analyze split delimiter usage

Step 2: Understand effect on last sentence

Final Answer:

Quick Check:

Solution

Step 1: Understand paragraph separation

Step 2: Parse paragraphs correctly

Final Answer:

Quick Check: