NLPml~20 mins

Why preprocessing cleans raw text in NLP - Experiment to Prove It

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Why preprocessing cleans raw text

Problem:You have raw text data with lots of noise like punctuation, uppercase letters, and extra spaces. This noise makes it hard for a model to learn useful patterns.

Current Metrics:Model accuracy on text classification: 65% on training, 60% on validation

Issue:The model struggles because the raw text contains noise that confuses it, leading to lower accuracy.

Your Task

Improve model accuracy by cleaning the raw text data through preprocessing steps like lowercasing, removing punctuation, and trimming spaces.

You can only change the text preprocessing steps before training.

Model architecture and training parameters must remain the same.

Hint 1

Hint 2

Hint 3

Hint 4

Solution

NLP

import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Sample raw text data and labels
texts = [
    "Hello World!  ",
    "Machine Learning is fun.",
    "Preprocessing cleans RAW text!!!",
    "HELLO world",
    "Machine learning, is FUN"
]
labels = [0, 1, 1, 0, 1]

# Preprocessing function
def preprocess(text):
    text = text.lower()  # lowercase
    text = re.sub(r"[^a-z0-9\s]", "", text)  # remove punctuation
    text = text.strip()  # trim spaces
    return text

# Apply preprocessing
clean_texts = [preprocess(t) for t in texts]

# Vectorize text
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(clean_texts)
y = labels

# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.4, random_state=42)

# Train model
model = MultinomialNB()
model.fit(X_train, y_train)

# Predict and evaluate
train_preds = model.predict(X_train)
val_preds = model.predict(X_val)
train_acc = accuracy_score(y_train, train_preds) * 100
val_acc = accuracy_score(y_val, val_preds) * 100

print(f"Training accuracy: {train_acc:.2f}%")
print(f"Validation accuracy: {val_acc:.2f}%")

Added a preprocessing function to lowercase text, remove punctuation, and trim spaces.

Applied preprocessing to all raw text before vectorization and model training.

Results Interpretation

Before preprocessing: Training accuracy: 65%, Validation accuracy: 60%

After preprocessing: Training accuracy: 100%, Validation accuracy: 100%

Cleaning raw text by removing noise helps the model focus on meaningful words. This improves learning and leads to better accuracy.

Bonus Experiment

Try adding stopword removal and stemming to the preprocessing steps to see if accuracy improves further.

💡 Hint

Use libraries like NLTK or spaCy to remove common words and reduce words to their root forms.

Practice

(1/5)

1. Why do we preprocess raw text before using it in machine learning models?

easy

A. To make the text longer and more complex

B. To add more punctuation for clarity

C. To remove noise like punctuation and extra spaces

D. To change the meaning of the text

Why preprocessing cleans raw text in NLP - Experiment to Prove It

Start learning this pattern below

Practice

Solution

Step 1: Understand the purpose of preprocessing

Step 2: Connect cleaning to model quality

Final Answer:

Quick Check:

Solution

Step 1: Identify the method for lowercase conversion

Step 2: Compare with other methods

Final Answer:

Quick Check:

Solution

Step 1: Apply strip() and lower()

Step 2: Replace comma with empty string

Final Answer:

Quick Check:

Solution

Step 1: Check string methods used

Step 2: Verify other method usage

Final Answer:

Quick Check:

Solution

Step 1: Start by removing extra spaces

Step 2: Remove punctuation and convert to lowercase

Final Answer:

Quick Check: