Custom pipeline components let you add your own steps to process text in NLP. This helps you tailor the pipeline to your specific needs.
0
0
Custom pipeline components in NLP
Introduction
You want to clean or modify text in a special way before analysis.
You need to add extra information to text, like tags or labels.
You want to run your own code between standard NLP steps.
You want to customize how the pipeline handles your data.
You want to measure or log something during processing.
Syntax
NLP
def custom_component(doc): # your code here return doc nlp.add_pipe(custom_component, name='custom_component', last=True)
The function takes a doc object and returns it after processing.
Use nlp.add_pipe() to add your component to the pipeline.
Examples
This component adds an uppercase version of each token as a custom attribute.
NLP
def uppercase_component(doc): for token in doc: token._.upper = token.text.upper() return doc nlp.add_pipe(uppercase_component, name='uppercase', last=True)
This component prints the number of tokens in the document during processing.
NLP
def count_tokens(doc): print(f'Tokens in doc: {len(doc)}') return doc nlp.add_pipe(count_tokens, name='count_tokens', last=False)
Sample Model
This program adds a custom pipeline component that marks tokens longer than 5 characters. It then prints each token with this info.
NLP
import spacy from spacy.tokens import Token # Load small English model nlp = spacy.load('en_core_web_sm') # Register a custom token attribute 'is_long' Token.set_extension('is_long', default=False) # Define custom component to mark tokens longer than 5 characters @spacy.Language.component('long_token_marker') def long_token_marker(doc): for token in doc: token._.is_long = len(token.text) > 5 return doc # Add the component to the pipeline nlp.add_pipe('long_token_marker', last=True) # Process text text = 'Spacy is a great library for natural language processing.' doc = nlp(text) # Print tokens and if they are long for token in doc: print(f'{token.text}: is_long={token._.is_long}')
OutputSuccess
Important Notes
Custom components must always return the doc object.
You can add custom attributes to tokens, spans, or docs using set_extension.
Use @spacy.Language.component decorator to register components cleanly.
Summary
Custom pipeline components let you add your own processing steps in NLP pipelines.
They take a doc and return it after changes.
You can add custom data or behavior to tokens or documents this way.