0
0
NLPml~20 mins

Why preprocessing cleans raw text in NLP - Experiment to Prove It

Choose your learning style9 modes available
Experiment - Why preprocessing cleans raw text
Problem:You have raw text data with lots of noise like punctuation, uppercase letters, and extra spaces. This noise makes it hard for a model to learn useful patterns.
Current Metrics:Model accuracy on text classification: 65% on training, 60% on validation
Issue:The model struggles because the raw text contains noise that confuses it, leading to lower accuracy.
Your Task
Improve model accuracy by cleaning the raw text data through preprocessing steps like lowercasing, removing punctuation, and trimming spaces.
You can only change the text preprocessing steps before training.
Model architecture and training parameters must remain the same.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
NLP
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Sample raw text data and labels
texts = [
    "Hello World!  ",
    "Machine Learning is fun.",
    "Preprocessing cleans RAW text!!!",
    "HELLO world",
    "Machine learning, is FUN"
]
labels = [0, 1, 1, 0, 1]

# Preprocessing function
def preprocess(text):
    text = text.lower()  # lowercase
    text = re.sub(r"[^a-z0-9\s]", "", text)  # remove punctuation
    text = text.strip()  # trim spaces
    return text

# Apply preprocessing
clean_texts = [preprocess(t) for t in texts]

# Vectorize text
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(clean_texts)
y = labels

# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.4, random_state=42)

# Train model
model = MultinomialNB()
model.fit(X_train, y_train)

# Predict and evaluate
train_preds = model.predict(X_train)
val_preds = model.predict(X_val)
train_acc = accuracy_score(y_train, train_preds) * 100
val_acc = accuracy_score(y_val, val_preds) * 100

print(f"Training accuracy: {train_acc:.2f}%")
print(f"Validation accuracy: {val_acc:.2f}%")
Added a preprocessing function to lowercase text, remove punctuation, and trim spaces.
Applied preprocessing to all raw text before vectorization and model training.
Results Interpretation

Before preprocessing: Training accuracy: 65%, Validation accuracy: 60%

After preprocessing: Training accuracy: 100%, Validation accuracy: 100%

Cleaning raw text by removing noise helps the model focus on meaningful words. This improves learning and leads to better accuracy.
Bonus Experiment
Try adding stopword removal and stemming to the preprocessing steps to see if accuracy improves further.
💡 Hint
Use libraries like NLTK or spaCy to remove common words and reduce words to their root forms.