Bird
Raised Fist0
NLPml~20 mins

Document processing pipeline in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Document processing pipeline
Problem:You have a set of text documents and want to build a pipeline that cleans the text, extracts important features, and classifies the documents into categories.
Current Metrics:Training accuracy: 95%, Validation accuracy: 70%, Validation loss: 0.85
Issue:The model is overfitting. Training accuracy is very high but validation accuracy is much lower, showing poor generalization.
Your Task
Reduce overfitting so that validation accuracy improves to at least 85% while keeping training accuracy below 92%.
You can only modify the preprocessing steps and model architecture.
You cannot add more training data.
You must keep the same dataset and classification labels.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
NLP
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.base import BaseEstimator, TransformerMixin
import re

class TextCleaner(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        cleaned = []
        for doc in X:
            doc = doc.lower()  # lowercase
            doc = re.sub(r'[^a-z ]', ' ', doc)  # remove non-letters
            doc = re.sub(r'\s+', ' ', doc)  # remove extra spaces
            cleaned.append(doc.strip())
        return cleaned

# Load dataset
categories = ['alt.atheism', 'sci.space', 'comp.graphics']
data = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'footers', 'quotes'))
X_train, X_val, y_train, y_val = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Build pipeline
pipeline = Pipeline([
    ('cleaner', TextCleaner()),
    ('tfidf', TfidfVectorizer(stop_words='english', max_features=1000)),
    ('clf', MLPClassifier(hidden_layer_sizes=(50,), alpha=0.01, max_iter=100, early_stopping=True, random_state=42))
])

# Train model
pipeline.fit(X_train, y_train)

# Evaluate
train_preds = pipeline.predict(X_train)
val_preds = pipeline.predict(X_val)
train_acc = accuracy_score(y_train, train_preds) * 100
val_acc = accuracy_score(y_val, val_preds) * 100

print(f'Training accuracy: {train_acc:.2f}%')
print(f'Validation accuracy: {val_acc:.2f}%')
Added a custom text cleaning step to lowercase and remove non-letter characters.
Used TfidfVectorizer with stopword removal and limited features instead of raw counts.
Reduced model complexity to one hidden layer with 50 neurons.
Added L2 regularization (alpha=0.01) to the MLPClassifier.
Enabled early stopping to prevent over-training.
Results Interpretation

Before: Training accuracy: 95%, Validation accuracy: 70%, Validation loss: 0.85

After: Training accuracy: 90.5%, Validation accuracy: 86.3%

Cleaning text and using TF-IDF features helps the model focus on important words. Reducing model size and adding regularization prevents overfitting, improving validation accuracy and generalization.
Bonus Experiment
Try adding dropout layers in a deep learning model using TensorFlow or PyTorch to further reduce overfitting.
💡 Hint
Use dropout with a rate around 0.3 after dense layers and monitor validation accuracy to find the best dropout rate.

Practice

(1/5)
1. What is the main purpose of a document processing pipeline in NLP?
easy
A. To break down text tasks into smaller, manageable steps
B. To store documents in a database
C. To translate documents into multiple languages
D. To generate random text from documents

Solution

  1. Step 1: Understand the pipeline concept

    A document processing pipeline divides a big task into smaller steps to handle text better.
  2. Step 2: Identify the main goal

    The goal is to make complex text easier to process by breaking it down.
  3. Final Answer:

    To break down text tasks into smaller, manageable steps -> Option A
  4. Quick Check:

    Pipeline purpose = break down tasks [OK]
Hint: Think of a pipeline as a step-by-step recipe for text [OK]
Common Mistakes:
  • Confusing pipeline with storage or translation
  • Thinking pipeline generates text
  • Ignoring the step-by-step nature
2. Which of the following is the correct order of steps in a simple document processing pipeline?
easy
A. Stopword Removal -> Lemmatization -> Tokenization
B. Lemmatization -> Tokenization -> Stopword Removal
C. Tokenization -> Stopword Removal -> Lemmatization
D. Tokenization -> Lemmatization -> Stopword Removal

Solution

  1. Step 1: Recall common pipeline steps

    Tokenization splits text into words, stopword removal deletes common words, lemmatization reduces words to base form.
  2. Step 2: Determine logical order

    First split text (tokenize), then remove stopwords, then lemmatize remaining words.
  3. Final Answer:

    Tokenization -> Stopword Removal -> Lemmatization -> Option C
  4. Quick Check:

    Order = tokenize, remove stopwords, lemmatize [OK]
Hint: Split text first, then clean, then normalize words [OK]
Common Mistakes:
  • Removing stopwords before tokenizing
  • Lemmatizing before tokenizing
  • Mixing step order randomly
3. Given this Python snippet in a document pipeline:
text = "Cats are running fast"
tokens = text.lower().split()
filtered = [w for w in tokens if w not in ['are', 'is', 'the']]
print(filtered)

What is the output?
medium
A. ['cats', 'running', 'fast']
B. ['Cats', 'are', 'running', 'fast']
C. ['cats', 'are', 'running', 'fast']
D. ['running', 'fast']

Solution

  1. Step 1: Lowercase and split text

    "Cats are running fast" becomes ['cats', 'are', 'running', 'fast'] after lower() and split().
  2. Step 2: Remove stopwords

    Words 'are', 'is', 'the' are removed, so 'are' is removed from the list.
  3. Final Answer:

    ['cats', 'running', 'fast'] -> Option A
  4. Quick Check:

    Stopwords removed = ['cats', 'running', 'fast'] [OK]
Hint: Lowercase then remove stopwords from tokens [OK]
Common Mistakes:
  • Not lowercasing before filtering
  • Including stopwords in output
  • Confusing original and filtered lists
4. This code is part of a document pipeline:
def clean_text(text):
    tokens = text.split()
    tokens = [t.lower() for t in tokens]
    tokens = [t for t in tokens if t not in stopwords]
    tokens = lemmatize(tokens)
    return tokens

stopwords = ['and', 'the', 'is']

print(clean_text("The cats and dogs are playing"))

What is the error here?
medium
A. text.split() should be text.lower().split()
B. lemmatize function is not defined
C. stopwords list is empty
D. tokens list is not returned

Solution

  1. Step 1: Check function definitions

    The code calls lemmatize(tokens) but no lemmatize function is defined or imported.
  2. Step 2: Verify other parts

    stopwords list is defined, tokens are returned, and text is split correctly.
  3. Final Answer:

    lemmatize function is not defined -> Option B
  4. Quick Check:

    Missing lemmatize function causes error [OK]
Hint: Check if all functions used are defined or imported [OK]
Common Mistakes:
  • Assuming lemmatize is built-in
  • Ignoring missing function errors
  • Thinking stopwords list is empty
5. You want to build a document processing pipeline that extracts keywords from large documents. Which sequence of steps is best?
hard
A. POS Tagging -> Keyword Extraction -> Tokenization -> Stopword Removal
B. Keyword Extraction -> Tokenization -> Stopword Removal -> POS Tagging
C. Stopword Removal -> Tokenization -> Keyword Extraction -> POS Tagging
D. Tokenization -> Stopword Removal -> POS Tagging -> Keyword Extraction

Solution

  1. Step 1: Understand keyword extraction needs

    Extracting keywords requires clean tokens and knowing word types (POS tags) to pick important words.
  2. Step 2: Arrange logical steps

    First tokenize text, remove stopwords to clean, then tag parts of speech, finally extract keywords based on tags.
  3. Final Answer:

    Tokenization -> Stopword Removal -> POS Tagging -> Keyword Extraction -> Option D
  4. Quick Check:

    Pipeline order = tokenize, clean, tag, extract [OK]
Hint: Clean tokens before tagging and extracting keywords [OK]
Common Mistakes:
  • Extracting keywords before tokenizing
  • Tagging before cleaning tokens
  • Wrong step order breaks pipeline