0
0
NLPml~5 mins

Custom pipeline components in NLP

Choose your learning style9 modes available
Introduction

Custom pipeline components let you add your own steps to process text in NLP. This helps you tailor the pipeline to your specific needs.

You want to clean or modify text in a special way before analysis.
You need to add extra information to text, like tags or labels.
You want to run your own code between standard NLP steps.
You want to customize how the pipeline handles your data.
You want to measure or log something during processing.
Syntax
NLP
def custom_component(doc):
    # your code here
    return doc

nlp.add_pipe(custom_component, name='custom_component', last=True)

The function takes a doc object and returns it after processing.

Use nlp.add_pipe() to add your component to the pipeline.

Examples
This component adds an uppercase version of each token as a custom attribute.
NLP
def uppercase_component(doc):
    for token in doc:
        token._.upper = token.text.upper()
    return doc

nlp.add_pipe(uppercase_component, name='uppercase', last=True)
This component prints the number of tokens in the document during processing.
NLP
def count_tokens(doc):
    print(f'Tokens in doc: {len(doc)}')
    return doc

nlp.add_pipe(count_tokens, name='count_tokens', last=False)
Sample Model

This program adds a custom pipeline component that marks tokens longer than 5 characters. It then prints each token with this info.

NLP
import spacy
from spacy.tokens import Token

# Load small English model
nlp = spacy.load('en_core_web_sm')

# Register a custom token attribute 'is_long'
Token.set_extension('is_long', default=False)

# Define custom component to mark tokens longer than 5 characters
@spacy.Language.component('long_token_marker')
def long_token_marker(doc):
    for token in doc:
        token._.is_long = len(token.text) > 5
    return doc

# Add the component to the pipeline
nlp.add_pipe('long_token_marker', last=True)

# Process text
text = 'Spacy is a great library for natural language processing.'
doc = nlp(text)

# Print tokens and if they are long
for token in doc:
    print(f'{token.text}: is_long={token._.is_long}')
OutputSuccess
Important Notes

Custom components must always return the doc object.

You can add custom attributes to tokens, spans, or docs using set_extension.

Use @spacy.Language.component decorator to register components cleanly.

Summary

Custom pipeline components let you add your own processing steps in NLP pipelines.

They take a doc and return it after changes.

You can add custom data or behavior to tokens or documents this way.