NLPml~20 mins

Document processing pipeline in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Document processing pipeline

Problem:You have a set of text documents and want to build a pipeline that cleans the text, extracts important features, and classifies the documents into categories.

Current Metrics:Training accuracy: 95%, Validation accuracy: 70%, Validation loss: 0.85

Issue:The model is overfitting. Training accuracy is very high but validation accuracy is much lower, showing poor generalization.

Your Task

Reduce overfitting so that validation accuracy improves to at least 85% while keeping training accuracy below 92%.

You can only modify the preprocessing steps and model architecture.

You cannot add more training data.

You must keep the same dataset and classification labels.

Hint 1

Hint 2

Hint 3

Hint 4

Solution

NLP

import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.base import BaseEstimator, TransformerMixin
import re

class TextCleaner(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        cleaned = []
        for doc in X:
            doc = doc.lower()  # lowercase
            doc = re.sub(r'[^a-z ]', ' ', doc)  # remove non-letters
            doc = re.sub(r'\s+', ' ', doc)  # remove extra spaces
            cleaned.append(doc.strip())
        return cleaned

# Load dataset
categories = ['alt.atheism', 'sci.space', 'comp.graphics']
data = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'footers', 'quotes'))
X_train, X_val, y_train, y_val = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Build pipeline
pipeline = Pipeline([
    ('cleaner', TextCleaner()),
    ('tfidf', TfidfVectorizer(stop_words='english', max_features=1000)),
    ('clf', MLPClassifier(hidden_layer_sizes=(50,), alpha=0.01, max_iter=100, early_stopping=True, random_state=42))
])

# Train model
pipeline.fit(X_train, y_train)

# Evaluate
train_preds = pipeline.predict(X_train)
val_preds = pipeline.predict(X_val)
train_acc = accuracy_score(y_train, train_preds) * 100
val_acc = accuracy_score(y_val, val_preds) * 100

print(f'Training accuracy: {train_acc:.2f}%')
print(f'Validation accuracy: {val_acc:.2f}%')

Added a custom text cleaning step to lowercase and remove non-letter characters.

Used TfidfVectorizer with stopword removal and limited features instead of raw counts.

Reduced model complexity to one hidden layer with 50 neurons.

Added L2 regularization (alpha=0.01) to the MLPClassifier.

Enabled early stopping to prevent over-training.

Results Interpretation

Before: Training accuracy: 95%, Validation accuracy: 70%, Validation loss: 0.85

After: Training accuracy: 90.5%, Validation accuracy: 86.3%

Cleaning text and using TF-IDF features helps the model focus on important words. Reducing model size and adding regularization prevents overfitting, improving validation accuracy and generalization.

Bonus Experiment

Try adding dropout layers in a deep learning model using TensorFlow or PyTorch to further reduce overfitting.

💡 Hint

Use dropout with a rate around 0.3 after dense layers and monitor validation accuracy to find the best dropout rate.

Practice

(1/5)

1. What is the main purpose of a document processing pipeline in NLP?

easy

A. To break down text tasks into smaller, manageable steps

B. To store documents in a database

C. To translate documents into multiple languages

D. To generate random text from documents

Document processing pipeline in NLP - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand the pipeline concept

Step 2: Identify the main goal

Final Answer:

Quick Check:

Solution

Step 1: Recall common pipeline steps

Step 2: Determine logical order

Final Answer:

Quick Check:

Solution

Step 1: Lowercase and split text

Step 2: Remove stopwords

Final Answer:

Quick Check:

Solution

Step 1: Check function definitions

Step 2: Verify other parts

Final Answer:

Quick Check:

Solution

Step 1: Understand keyword extraction needs

Step 2: Arrange logical steps

Final Answer:

Quick Check: