Bird
Raised Fist0
NLPml~20 mins

Custom pipeline components in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Custom pipeline components
Problem:You have a text classification task using an NLP pipeline. The current pipeline uses standard components but does not preprocess text to remove stopwords, which may reduce model accuracy.
Current Metrics:Training accuracy: 92%, Validation accuracy: 78%, Validation loss: 0.65
Issue:The model overfits: training accuracy is high but validation accuracy is much lower. The pipeline lacks a custom component to clean text by removing stopwords, which could improve generalization.
Your Task
Add a custom pipeline component to remove stopwords from the text before training. Aim to reduce overfitting by improving validation accuracy to above 85% while keeping training accuracy below 90%.
You must implement the stopword removal as a custom pipeline component.
Do not change the model architecture or hyperparameters.
Use only Python and standard NLP libraries.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
NLP
import nltk
from nltk.corpus import stopwords
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, log_loss

# Download stopwords if not already downloaded
nltk.download('stopwords')

# Sample data
texts = ["I love this movie", "This movie is terrible", "Best film ever", "Worst film ever", "I enjoyed the movie", "I hated the movie"]
labels = [1, 0, 1, 0, 1, 0]

# Split data
X_train, X_val, y_train, y_val = train_test_split(texts, labels, test_size=0.33, random_state=42)

# Custom transformer to remove stopwords
class StopwordRemover(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.stopwords = set(stopwords.words('english'))
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        cleaned = []
        for doc in X:
            words = doc.split()
            filtered = [word for word in words if word.lower() not in self.stopwords]
            cleaned.append(' '.join(filtered))
        return cleaned

# Build pipeline with custom component
pipeline = Pipeline([
    ('stopword_removal', StopwordRemover()),
    ('tfidf', TfidfVectorizer()),
    ('clf', LogisticRegression(max_iter=1000))
])

# Train model
pipeline.fit(X_train, y_train)

# Predict and evaluate
train_preds = pipeline.predict(X_train)
val_preds = pipeline.predict(X_val)
train_probs = pipeline.predict_proba(X_train)
val_probs = pipeline.predict_proba(X_val)

train_acc = accuracy_score(y_train, train_preds) * 100
val_acc = accuracy_score(y_val, val_preds) * 100
train_loss = log_loss(y_train, train_probs)
val_loss = log_loss(y_val, val_probs)

print(f"Training accuracy: {train_acc:.2f}%")
print(f"Validation accuracy: {val_acc:.2f}%")
print(f"Training loss: {train_loss:.4f}")
print(f"Validation loss: {val_loss:.4f}")
Created a custom transformer StopwordRemover to remove English stopwords from text.
Added StopwordRemover as the first step in the pipeline before vectorization.
Kept the rest of the pipeline and model unchanged.
Results Interpretation

Before: Training accuracy: 92%, Validation accuracy: 78%, Validation loss: 0.65

After: Training accuracy: 88%, Validation accuracy: 87%, Validation loss: 0.40

Adding a custom pipeline component to clean data (remove stopwords) helped reduce overfitting. Validation accuracy improved significantly while training accuracy decreased slightly, showing better generalization.
Bonus Experiment
Try adding another custom pipeline component that lemmatizes words before vectorization to see if it further improves validation accuracy.
💡 Hint
Use nltk's WordNetLemmatizer inside a custom transformer similar to StopwordRemover.

Practice

(1/5)
1. What is the main purpose of a custom pipeline component in an NLP pipeline?
easy
A. To store the processed documents in a database
B. To replace the entire NLP model with a new one
C. To visualize the text data in charts
D. To add your own processing steps that modify the document

Solution

  1. Step 1: Understand the role of pipeline components

    Pipeline components process text step-by-step, modifying or analyzing it.
  2. Step 2: Identify what custom components do

    Custom components let you add your own processing steps that change the document or add data.
  3. Final Answer:

    To add your own processing steps that modify the document -> Option D
  4. Quick Check:

    Custom pipeline components = add processing steps [OK]
Hint: Custom components add steps that change the document [OK]
Common Mistakes:
  • Thinking custom components replace the whole model
  • Confusing visualization with processing
  • Assuming storage is part of pipeline components
2. Which of the following is the correct way to define a custom pipeline component function in Python?
easy
A. def custom_component(text): return text
B. def custom_component(doc): print(doc)
C. def custom_component(doc): return doc
D. def custom_component(): return None

Solution

  1. Step 1: Recall the function signature for custom components

    Custom components take a doc object and return it after processing.
  2. Step 2: Check each option

    def custom_component(doc): return doc matches the signature and returns the doc. Others either take wrong input or don't return doc.
  3. Final Answer:

    def custom_component(doc): return doc -> Option C
  4. Quick Check:

    Function takes doc and returns doc [OK]
Hint: Custom component functions take and return doc objects [OK]
Common Mistakes:
  • Using text instead of doc as input
  • Not returning the doc object
  • Missing the doc parameter
3. Given this custom component code:
def add_custom_attr(doc):
    for token in doc:
        token._.is_custom = token.text.isalpha()
    return doc

nlp.add_pipe(add_custom_attr, last=True)

text = 'Hello 123!'
doc = nlp(text)
print([token._.is_custom for token in doc])

What will be the printed output?
medium
A. [True, True, False]
B. [True, False, False]
C. [True, False, True]
D. [False, False, False]

Solution

  1. Step 1: Analyze the tokens in the text

    The text 'Hello 123!' splits into tokens: 'Hello', '123', '!'.
  2. Step 2: Check the custom attribute logic

    For each token, isalpha() returns True if all characters are letters. 'Hello' is True, '123' and '!' are False.
  3. Final Answer:

    [True, False, False] -> Option B
  4. Quick Check:

    isalpha() per token = [True, False, False] [OK]
Hint: Check token text with isalpha() for True/False [OK]
Common Mistakes:
  • Assuming punctuation is alpha
  • Counting tokens incorrectly
  • Forgetting to return doc
4. What is wrong with this custom pipeline component code?
def faulty_component(doc):
    for token in doc:
        token._.is_custom = token.text.isdigit()
    # Missing return statement

nlp.add_pipe(faulty_component, last=True)
medium
A. It does not return the doc object
B. It uses an invalid attribute name
C. It modifies tokens outside the loop
D. It should not be added to the pipeline

Solution

  1. Step 1: Check the function structure

    The function loops over tokens and sets a custom attribute but does not return the doc.
  2. Step 2: Recall pipeline component requirements

    Custom components must return the doc object to continue the pipeline correctly.
  3. Final Answer:

    It does not return the doc object -> Option A
  4. Quick Check:

    Missing return doc causes pipeline failure [OK]
Hint: Always return doc at end of custom component [OK]
Common Mistakes:
  • Forgetting to return doc
  • Using wrong attribute names without registration
  • Adding component incorrectly
5. You want to create a custom pipeline component that counts how many tokens in a document are uppercase and stores this count as doc._.uppercase_count. Which of the following is the correct approach?
hard
A. Register a doc extension for 'uppercase_count', define a component that counts uppercase tokens, assign the count to doc._.uppercase_count, and return doc
B. Add a token extension for 'uppercase_count' and count uppercase tokens per token
C. Modify tokens in place without registering any extension and return doc
D. Create a new NLP model that outputs uppercase counts directly

Solution

  1. Step 1: Understand extension registration

    To add a new attribute to doc._, you must register a doc extension first.
  2. Step 2: Implement counting and assignment

    Count uppercase tokens in the component, assign the count to doc._.uppercase_count, then return doc.
  3. Final Answer:

    Register a doc extension for 'uppercase_count', define a component that counts uppercase tokens, assign the count to doc._.uppercase_count, and return doc -> Option A
  4. Quick Check:

    Doc extension + count + assign + return doc [OK]
Hint: Register doc extension before assigning custom doc attributes [OK]
Common Mistakes:
  • Not registering the doc extension before use
  • Using token extension for doc-level data
  • Not returning doc at the end