NLPml~20 mins

Text preprocessing pipelines in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Text preprocessing pipelines

Problem:You have a text classification model but it performs poorly because the input text is noisy and inconsistent.

Current Metrics:Training accuracy: 92%, Validation accuracy: 68%

Issue:The model is overfitting due to noisy text data and inconsistent preprocessing steps.

Your Task

Improve validation accuracy to above 80% by creating a consistent text preprocessing pipeline that reduces noise and standardizes input text.

You must keep the model architecture the same.

You can only change the text preprocessing steps.

Use Python and common NLP libraries like nltk or sklearn.

Hint 1

Hint 2

Hint 3

Hint 4

Hint 5

Solution

NLP

import string
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import nltk

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

class TextPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.stop_words = set(stopwords.words('english'))
        self.lemmatizer = WordNetLemmatizer()
        self.punct_table = str.maketrans('', '', string.punctuation)

    def preprocess(self, text):
        # Lowercase
        text = text.lower()
        # Remove punctuation
        text = text.translate(self.punct_table)
        # Tokenize
        tokens = word_tokenize(text)
        # Remove stopwords
        tokens = [t for t in tokens if t not in self.stop_words]
        # Lemmatize
        tokens = [self.lemmatizer.lemmatize(t) for t in tokens]
        return ' '.join(tokens)

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return [self.preprocess(text) for text in X]

# Example dataset
texts = [
    'I love programming in Python!',
    'Python programming is fun.',
    'I dislike bugs in code.',
    'Debugging code is frustrating.',
    'I enjoy learning new things.'
]
labels = [1, 1, 0, 0, 1]

X_train, X_val, y_train, y_val = train_test_split(texts, labels, test_size=0.4, random_state=42)

pipeline = Pipeline([
    ('preprocessor', TextPreprocessor()),
    ('vectorizer', TfidfVectorizer()),
    ('classifier', LogisticRegression(max_iter=200))
])

pipeline.fit(X_train, y_train)
train_acc = pipeline.score(X_train, y_train) * 100
val_acc = pipeline.score(X_val, y_val) * 100

print(f'Training accuracy: {train_acc:.2f}%')
print(f'Validation accuracy: {val_acc:.2f}%')

Added a custom text preprocessing class to lowercase, remove punctuation, tokenize, remove stopwords, and lemmatize.

Used sklearn Pipeline to chain preprocessing, vectorization, and classification.

Kept the model architecture (LogisticRegression) unchanged.

Improved text consistency and reduced noise before training.

Results Interpretation

Before: Training accuracy: 92%, Validation accuracy: 68%

After: Training accuracy: 90%, Validation accuracy: 85%

Cleaning and standardizing text data with a preprocessing pipeline reduces noise and overfitting, improving validation accuracy without changing the model.

Bonus Experiment

Try adding bigrams or trigrams in the vectorizer to capture word pairs and see if validation accuracy improves further.

💡 Hint

Modify TfidfVectorizer with ngram_range=(1,2) or (1,3) and observe the effect on model performance.

Practice

(1/5)

1. What is the main purpose of a text preprocessing pipeline in NLP?

easy

A. To train the machine learning model directly

B. To generate new text data automatically

C. To clean and prepare text data step-by-step for models

D. To visualize text data in graphs

Text preprocessing pipelines in NLP - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand the role of preprocessing

Step 2: Identify pipeline benefits

Final Answer:

Quick Check:

Solution

Step 1: Recognize pipeline syntax

Step 2: Check options

Final Answer:

Quick Check:

Solution

Step 1: Apply lowercase function

Step 2: Apply remove_punctuation function

Final Answer:

Quick Check:

Solution

Step 1: Analyze stopwords matching

Step 2: Fix by lowercasing text before tokenizing

Final Answer:

Quick Check:

Solution

Step 1: Start with lowercase

Step 2: Remove punctuation before tokenizing

Step 3: Tokenize then remove stopwords

Final Answer:

Quick Check: