NLPml~20 mins

Custom pipeline components in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Custom pipeline components

Problem:You have a text classification task using an NLP pipeline. The current pipeline uses standard components but does not preprocess text to remove stopwords, which may reduce model accuracy.

Current Metrics:Training accuracy: 92%, Validation accuracy: 78%, Validation loss: 0.65

Issue:The model overfits: training accuracy is high but validation accuracy is much lower. The pipeline lacks a custom component to clean text by removing stopwords, which could improve generalization.

Your Task

Add a custom pipeline component to remove stopwords from the text before training. Aim to reduce overfitting by improving validation accuracy to above 85% while keeping training accuracy below 90%.

You must implement the stopword removal as a custom pipeline component.

Do not change the model architecture or hyperparameters.

Use only Python and standard NLP libraries.

Hint 1

Hint 2

Hint 3

Hint 4

Solution

NLP

import nltk
from nltk.corpus import stopwords
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, log_loss

# Download stopwords if not already downloaded
nltk.download('stopwords')

# Sample data
texts = ["I love this movie", "This movie is terrible", "Best film ever", "Worst film ever", "I enjoyed the movie", "I hated the movie"]
labels = [1, 0, 1, 0, 1, 0]

# Split data
X_train, X_val, y_train, y_val = train_test_split(texts, labels, test_size=0.33, random_state=42)

# Custom transformer to remove stopwords
class StopwordRemover(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.stopwords = set(stopwords.words('english'))
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        cleaned = []
        for doc in X:
            words = doc.split()
            filtered = [word for word in words if word.lower() not in self.stopwords]
            cleaned.append(' '.join(filtered))
        return cleaned

# Build pipeline with custom component
pipeline = Pipeline([
    ('stopword_removal', StopwordRemover()),
    ('tfidf', TfidfVectorizer()),
    ('clf', LogisticRegression(max_iter=1000))
])

# Train model
pipeline.fit(X_train, y_train)

# Predict and evaluate
train_preds = pipeline.predict(X_train)
val_preds = pipeline.predict(X_val)
train_probs = pipeline.predict_proba(X_train)
val_probs = pipeline.predict_proba(X_val)

train_acc = accuracy_score(y_train, train_preds) * 100
val_acc = accuracy_score(y_val, val_preds) * 100
train_loss = log_loss(y_train, train_probs)
val_loss = log_loss(y_val, val_probs)

print(f"Training accuracy: {train_acc:.2f}%")
print(f"Validation accuracy: {val_acc:.2f}%")
print(f"Training loss: {train_loss:.4f}")
print(f"Validation loss: {val_loss:.4f}")

Created a custom transformer StopwordRemover to remove English stopwords from text.

Added StopwordRemover as the first step in the pipeline before vectorization.

Kept the rest of the pipeline and model unchanged.

Results Interpretation

Before: Training accuracy: 92%, Validation accuracy: 78%, Validation loss: 0.65

After: Training accuracy: 88%, Validation accuracy: 87%, Validation loss: 0.40

Adding a custom pipeline component to clean data (remove stopwords) helped reduce overfitting. Validation accuracy improved significantly while training accuracy decreased slightly, showing better generalization.

Bonus Experiment

Try adding another custom pipeline component that lemmatizes words before vectorization to see if it further improves validation accuracy.

💡 Hint

Use nltk's WordNetLemmatizer inside a custom transformer similar to StopwordRemover.

Practice

(1/5)

1. What is the main purpose of a custom pipeline component in an NLP pipeline?

easy

A. To store the processed documents in a database

B. To replace the entire NLP model with a new one

C. To visualize the text data in charts

D. To add your own processing steps that modify the document

Custom pipeline components in NLP - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand the role of pipeline components

Step 2: Identify what custom components do

Final Answer:

Quick Check:

Solution

Step 1: Recall the function signature for custom components

Step 2: Check each option

Final Answer:

Quick Check:

Solution

Step 1: Analyze the tokens in the text

Step 2: Check the custom attribute logic

Final Answer:

Quick Check:

Solution

Step 1: Check the function structure

Step 2: Recall pipeline component requirements

Final Answer:

Quick Check:

Solution

Step 1: Understand extension registration

Step 2: Implement counting and assignment

Final Answer:

Quick Check: