0
0
ElasticsearchHow-ToBeginner · 4 min read

How to Enrich Data During Indexing in Elasticsearch

To enrich data during indexing in Elasticsearch, use ingest pipelines with processors like set, script, or geoip to modify or add fields before storing documents. This allows you to transform, add metadata, or enhance data automatically as it is indexed.
📐

Syntax

An ingest pipeline is defined with a set of processors that modify documents during indexing. Each processor performs a specific task like adding fields, running scripts, or extracting data.

Basic syntax to create a pipeline:

{
  "description": "Pipeline description",
  "processors": [
    { "processor_type": { "field": "value", ... } },
    ...
  ]
}

When indexing, specify the pipeline name to apply it:

POST /index/_doc?pipeline=pipeline_name
{
  "field": "value"
}
json
{
  "description": "Add a new field",
  "processors": [
    {
      "set": {
        "field": "new_field",
        "value": "enriched_value"
      }
    }
  ]
}
💻

Example

This example creates an ingest pipeline that adds a source field and enriches IP data with geo-location info using the geoip processor. Then it indexes a document using this pipeline.

json
PUT _ingest/pipeline/enrich_pipeline
{
  "description": "Add source and geoip info",
  "processors": [
    {
      "set": {
        "field": "source",
        "value": "web"
      }
    },
    {
      "geoip": {
        "field": "ip"
      }
    }
  ]
}

POST /logs/_doc?pipeline=enrich_pipeline
{
  "ip": "8.8.8.8",
  "message": "User accessed the site"
}
Output
{ "_index": "logs", "_id": "generated_id", "_version": 1, "result": "created", "_shards": { "total": 2, "successful": 2, "failed": 0 }, "_seq_no": 0, "_primary_term": 1 }
⚠️

Common Pitfalls

  • Not specifying the pipeline name during indexing means enrichment won't happen.
  • Using incorrect processor field names causes pipeline failures.
  • Overloading pipelines with many processors can slow indexing.
  • For dynamic enrichment, scripts must be carefully tested to avoid errors.
json
Wrong way (missing pipeline):
POST /logs/_doc
{
  "ip": "8.8.8.8",
  "message": "User accessed the site"
}

Right way (using pipeline):
POST /logs/_doc?pipeline=enrich_pipeline
{
  "ip": "8.8.8.8",
  "message": "User accessed the site"
}
📊

Quick Reference

ProcessorPurposeExample Usage
setAdd or update a field{"set": {"field": "status", "value": "active"}}
geoipAdd geo-location from IP{"geoip": {"field": "ip"}}
scriptRun custom script to modify data{"script": {"source": "ctx.field += ' enriched'"}}
renameRename a field{"rename": {"field": "old", "target_field": "new"}}
removeRemove a field{"remove": {"field": "temp"}}

Key Takeaways

Use ingest pipelines with processors to enrich data automatically during indexing.
Always specify the pipeline name in the indexing request to apply enrichment.
Test processors and scripts carefully to avoid indexing errors.
Common processors include set, geoip, script, rename, and remove.
Keep pipelines efficient to maintain good indexing performance.