Custom pipeline components let you add your own steps to process text in NLP. This helps you tailor the pipeline to your specific needs.
Custom pipeline components in NLP
Start learning this pattern below
Jump into concepts and practice - no test required
or
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Introduction
Syntax
NLP
def custom_component(doc): # your code here return doc nlp.add_pipe(custom_component, name='custom_component', last=True)
The function takes a doc object and returns it after processing.
Use nlp.add_pipe() to add your component to the pipeline.
Examples
NLP
def uppercase_component(doc): for token in doc: token._.upper = token.text.upper() return doc nlp.add_pipe(uppercase_component, name='uppercase', last=True)
NLP
def count_tokens(doc): print(f'Tokens in doc: {len(doc)}') return doc nlp.add_pipe(count_tokens, name='count_tokens', last=False)
Sample Model
This program adds a custom pipeline component that marks tokens longer than 5 characters. It then prints each token with this info.
NLP
import spacy from spacy.tokens import Token # Load small English model nlp = spacy.load('en_core_web_sm') # Register a custom token attribute 'is_long' Token.set_extension('is_long', default=False) # Define custom component to mark tokens longer than 5 characters @spacy.Language.component('long_token_marker') def long_token_marker(doc): for token in doc: token._.is_long = len(token.text) > 5 return doc # Add the component to the pipeline nlp.add_pipe('long_token_marker', last=True) # Process text text = 'Spacy is a great library for natural language processing.' doc = nlp(text) # Print tokens and if they are long for token in doc: print(f'{token.text}: is_long={token._.is_long}')
Important Notes
Custom components must always return the doc object.
You can add custom attributes to tokens, spans, or docs using set_extension.
Use @spacy.Language.component decorator to register components cleanly.
Summary
Custom pipeline components let you add your own processing steps in NLP pipelines.
They take a doc and return it after changes.
You can add custom data or behavior to tokens or documents this way.
Practice
1. What is the main purpose of a custom pipeline component in an NLP pipeline?
easy
Solution
Step 1: Understand the role of pipeline components
Pipeline components process text step-by-step, modifying or analyzing it.Step 2: Identify what custom components do
Custom components let you add your own processing steps that change the document or add data.Final Answer:
To add your own processing steps that modify the document -> Option DQuick Check:
Custom pipeline components = add processing steps [OK]
Hint: Custom components add steps that change the document [OK]
Common Mistakes:
- Thinking custom components replace the whole model
- Confusing visualization with processing
- Assuming storage is part of pipeline components
2. Which of the following is the correct way to define a custom pipeline component function in Python?
easy
Solution
Step 1: Recall the function signature for custom components
Custom components take adocobject and return it after processing.Step 2: Check each option
def custom_component(doc): return doc matches the signature and returns the doc. Others either take wrong input or don't return doc.Final Answer:
def custom_component(doc): return doc -> Option CQuick Check:
Function takes doc and returns doc [OK]
Hint: Custom component functions take and return doc objects [OK]
Common Mistakes:
- Using text instead of doc as input
- Not returning the doc object
- Missing the doc parameter
3. Given this custom component code:
What will be the printed output?
def add_custom_attr(doc):
for token in doc:
token._.is_custom = token.text.isalpha()
return doc
nlp.add_pipe(add_custom_attr, last=True)
text = 'Hello 123!'
doc = nlp(text)
print([token._.is_custom for token in doc])What will be the printed output?
medium
Solution
Step 1: Analyze the tokens in the text
The text 'Hello 123!' splits into tokens: 'Hello', '123', '!'.Step 2: Check the custom attribute logic
For each token, isalpha() returns True if all characters are letters. 'Hello' is True, '123' and '!' are False.Final Answer:
[True, False, False] -> Option BQuick Check:
isalpha() per token = [True, False, False] [OK]
Hint: Check token text with isalpha() for True/False [OK]
Common Mistakes:
- Assuming punctuation is alpha
- Counting tokens incorrectly
- Forgetting to return doc
4. What is wrong with this custom pipeline component code?
def faulty_component(doc):
for token in doc:
token._.is_custom = token.text.isdigit()
# Missing return statement
nlp.add_pipe(faulty_component, last=True)medium
Solution
Step 1: Check the function structure
The function loops over tokens and sets a custom attribute but does not return the doc.Step 2: Recall pipeline component requirements
Custom components must return the doc object to continue the pipeline correctly.Final Answer:
It does not return the doc object -> Option AQuick Check:
Missing return doc causes pipeline failure [OK]
Hint: Always return doc at end of custom component [OK]
Common Mistakes:
- Forgetting to return doc
- Using wrong attribute names without registration
- Adding component incorrectly
5. You want to create a custom pipeline component that counts how many tokens in a document are uppercase and stores this count as
doc._.uppercase_count. Which of the following is the correct approach?hard
Solution
Step 1: Understand extension registration
To add a new attribute todoc._, you must register a doc extension first.Step 2: Implement counting and assignment
Count uppercase tokens in the component, assign the count todoc._.uppercase_count, then return doc.Final Answer:
Register a doc extension for 'uppercase_count', define a component that counts uppercase tokens, assign the count to doc._.uppercase_count, and return doc -> Option AQuick Check:
Doc extension + count + assign + return doc [OK]
Hint: Register doc extension before assigning custom doc attributes [OK]
Common Mistakes:
- Not registering the doc extension before use
- Using token extension for doc-level data
- Not returning doc at the end
