Bird
Raised Fist0
NLPml~20 mins

Text preprocessing pipelines in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Text preprocessing pipelines
Problem:You have a text classification model but it performs poorly because the input text is noisy and inconsistent.
Current Metrics:Training accuracy: 92%, Validation accuracy: 68%
Issue:The model is overfitting due to noisy text data and inconsistent preprocessing steps.
Your Task
Improve validation accuracy to above 80% by creating a consistent text preprocessing pipeline that reduces noise and standardizes input text.
You must keep the model architecture the same.
You can only change the text preprocessing steps.
Use Python and common NLP libraries like nltk or sklearn.
Hint 1
Hint 2
Hint 3
Hint 4
Hint 5
Solution
NLP
import string
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import nltk

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

class TextPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.stop_words = set(stopwords.words('english'))
        self.lemmatizer = WordNetLemmatizer()
        self.punct_table = str.maketrans('', '', string.punctuation)

    def preprocess(self, text):
        # Lowercase
        text = text.lower()
        # Remove punctuation
        text = text.translate(self.punct_table)
        # Tokenize
        tokens = word_tokenize(text)
        # Remove stopwords
        tokens = [t for t in tokens if t not in self.stop_words]
        # Lemmatize
        tokens = [self.lemmatizer.lemmatize(t) for t in tokens]
        return ' '.join(tokens)

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return [self.preprocess(text) for text in X]

# Example dataset
texts = [
    'I love programming in Python!',
    'Python programming is fun.',
    'I dislike bugs in code.',
    'Debugging code is frustrating.',
    'I enjoy learning new things.'
]
labels = [1, 1, 0, 0, 1]

X_train, X_val, y_train, y_val = train_test_split(texts, labels, test_size=0.4, random_state=42)

pipeline = Pipeline([
    ('preprocessor', TextPreprocessor()),
    ('vectorizer', TfidfVectorizer()),
    ('classifier', LogisticRegression(max_iter=200))
])

pipeline.fit(X_train, y_train)
train_acc = pipeline.score(X_train, y_train) * 100
val_acc = pipeline.score(X_val, y_val) * 100

print(f'Training accuracy: {train_acc:.2f}%')
print(f'Validation accuracy: {val_acc:.2f}%')
Added a custom text preprocessing class to lowercase, remove punctuation, tokenize, remove stopwords, and lemmatize.
Used sklearn Pipeline to chain preprocessing, vectorization, and classification.
Kept the model architecture (LogisticRegression) unchanged.
Improved text consistency and reduced noise before training.
Results Interpretation

Before: Training accuracy: 92%, Validation accuracy: 68%

After: Training accuracy: 90%, Validation accuracy: 85%

Cleaning and standardizing text data with a preprocessing pipeline reduces noise and overfitting, improving validation accuracy without changing the model.
Bonus Experiment
Try adding bigrams or trigrams in the vectorizer to capture word pairs and see if validation accuracy improves further.
💡 Hint
Modify TfidfVectorizer with ngram_range=(1,2) or (1,3) and observe the effect on model performance.

Practice

(1/5)
1. What is the main purpose of a text preprocessing pipeline in NLP?
easy
A. To train the machine learning model directly
B. To generate new text data automatically
C. To clean and prepare text data step-by-step for models
D. To visualize text data in graphs

Solution

  1. Step 1: Understand the role of preprocessing

    Preprocessing cleans and prepares raw text so models can understand it better.
  2. Step 2: Identify pipeline benefits

    Pipelines organize these steps neatly and make the process repeatable.
  3. Final Answer:

    To clean and prepare text data step-by-step for models -> Option C
  4. Quick Check:

    Preprocessing pipeline = clean and prepare text [OK]
Hint: Pipelines organize cleaning steps before modeling [OK]
Common Mistakes:
  • Confusing preprocessing with model training
  • Thinking pipelines generate new text
  • Assuming pipelines visualize data
2. Which of the following is the correct way to chain text preprocessing steps in Python using a pipeline?
easy
A. pipeline = [tokenize, lowercase, remove_stopwords]
B. pipeline = Pipeline(steps=[('tokenize', tokenize), ('lowercase', lowercase), ('stop', remove_stopwords)])
C. pipeline = tokenize + lowercase + remove_stopwords
D. pipeline = tokenize.lowercase.remove_stopwords()

Solution

  1. Step 1: Recognize pipeline syntax

    In Python, pipelines are often created using a Pipeline class with named steps.
  2. Step 2: Check options

    pipeline = Pipeline(steps=[('tokenize', tokenize), ('lowercase', lowercase), ('stop', remove_stopwords)]) correctly uses Pipeline with steps as tuples of (name, function).
  3. Final Answer:

    pipeline = Pipeline(steps=[('tokenize', tokenize), ('lowercase', lowercase), ('stop', remove_stopwords)]) -> Option B
  4. Quick Check:

    Pipeline uses steps list with (name, function) tuples [OK]
Hint: Use Pipeline class with named steps list [OK]
Common Mistakes:
  • Trying to chain functions with dots or plus signs
  • Not naming steps in the pipeline
  • Using list of functions without Pipeline wrapper
3. Given the following code snippet, what will be the output of processed_text?
def lowercase(text):
    return text.lower()

def remove_punctuation(text):
    return ''.join(c for c in text if c.isalnum() or c.isspace())

text = "Hello, World!"

pipeline = [lowercase, remove_punctuation]

processed_text = text
for step in pipeline:
    processed_text = step(processed_text)

print(processed_text)
medium
A. hello world
B. Hello World
C. hello, world!
D. HELLO WORLD

Solution

  1. Step 1: Apply lowercase function

    "Hello, World!" becomes "hello, world!" after lowercase.
  2. Step 2: Apply remove_punctuation function

    Removes commas and exclamation marks, leaving "hello world".
  3. Final Answer:

    hello world -> Option A
  4. Quick Check:

    Lowercase + remove punctuation = "hello world" [OK]
Hint: Apply steps one by one on text [OK]
Common Mistakes:
  • Forgetting to lowercase before removing punctuation
  • Assuming punctuation remains
  • Confusing case sensitivity
4. Identify the error in this text preprocessing pipeline code and select the fix:
def tokenize(text):
    return text.split()

def remove_stopwords(words):
    stopwords = ['the', 'is', 'at']
    return [w for w in words if w not in stopwords]

text = "The cat is at the door"

pipeline = [tokenize, remove_stopwords]

processed = text
for step in pipeline:
    processed = step(processed)

print(processed)
medium
A. Define stopwords outside the function
B. Add join after remove_stopwords to convert list back to string
C. Replace split() with list() in tokenize
D. Change text to lowercase before tokenizing

Solution

  1. Step 1: Analyze stopwords matching

    Stopwords are lowercase but input text has capitalized words, so matching fails.
  2. Step 2: Fix by lowercasing text before tokenizing

    Lowercasing ensures stopwords match and are removed correctly.
  3. Final Answer:

    Change text to lowercase before tokenizing -> Option D
  4. Quick Check:

    Lowercase text first to match stopwords [OK]
Hint: Lowercase text before removing stopwords [OK]
Common Mistakes:
  • Ignoring case mismatch in stopwords
  • Trying to join list without need
  • Changing split() to list() incorrectly
5. You want to build a text preprocessing pipeline that: 1. Converts text to lowercase 2. Removes punctuation 3. Tokenizes text into words 4. Removes stopwords Which of the following pipeline orders is correct to ensure proper processing?
hard
A. Lowercase -> Remove punctuation -> Tokenize -> Remove stopwords
B. Tokenize -> Lowercase -> Remove stopwords -> Remove punctuation
C. Remove stopwords -> Tokenize -> Lowercase -> Remove punctuation
D. Remove punctuation -> Remove stopwords -> Tokenize -> Lowercase

Solution

  1. Step 1: Start with lowercase

    Lowercasing first ensures uniform text for all later steps.
  2. Step 2: Remove punctuation before tokenizing

    Removing punctuation cleans text so tokens are words only.
  3. Step 3: Tokenize then remove stopwords

    Tokenizing splits text into words, then stopwords can be removed from tokens.
  4. Final Answer:

    Lowercase -> Remove punctuation -> Tokenize -> Remove stopwords -> Option A
  5. Quick Check:

    Correct pipeline order = A [OK]
Hint: Lowercase, clean, tokenize, then filter stopwords [OK]
Common Mistakes:
  • Tokenizing before cleaning punctuation
  • Removing stopwords before tokenizing
  • Not lowercasing first