Bulk indexing optimization in Elasticsearch - Time & Space Complexity
Start learning this pattern below
Jump into concepts and practice - no test required
When adding many documents to Elasticsearch at once, it is important to understand how the time taken grows as the number of documents increases.
We want to know how the bulk indexing process scales with more data.
Analyze the time complexity of the following bulk indexing request.
POST /my_index/_bulk
{ "index": { "_id": "1" } }
{ "field": "value1" }
{ "index": { "_id": "2" } }
{ "field": "value2" }
{ "index": { "_id": "3" } }
{ "field": "value3" }
This code sends multiple documents in one bulk request to Elasticsearch for indexing.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Processing each document in the bulk request one by one.
- How many times: Once for each document in the bulk batch.
As the number of documents in the bulk request increases, the total work grows roughly in direct proportion.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 10 document processes |
| 100 | 100 document processes |
| 1000 | 1000 document processes |
Pattern observation: Doubling the number of documents roughly doubles the work needed.
Time Complexity: O(n)
This means the time to index grows linearly with the number of documents sent in the bulk request.
[X] Wrong: "Sending more documents in one bulk request will make indexing time stay the same or grow very little."
[OK] Correct: Each document still needs to be processed, so the total time grows roughly in direct proportion to the number of documents.
Understanding how bulk indexing scales helps you design efficient data loading processes and shows you can reason about performance in real systems.
"What if we split the bulk request into many smaller batches instead of one large batch? How would the time complexity change?"
Practice
_bulk API in Elasticsearch for indexing documents?Solution
Step 1: Understand the purpose of bulk API
The bulk API is designed to send multiple documents in a single request to Elasticsearch.Step 2: Identify the main advantage
Sending many documents at once reduces network overhead and speeds up indexing.Final Answer:
It reduces the number of network requests by sending many documents at once. -> Option AQuick Check:
Bulk API = fewer requests = faster indexing [OK]
- Thinking bulk API fixes document errors automatically
- Believing bulk API compresses data for storage
- Assuming bulk API indexes documents one by one
Solution
Step 1: Review bulk action types
Elasticsearch bulk API supports multiple actions: index, create, update.Step 2: Check each option
A shows an index action, C an update action, D a create action. All are valid formats.Final Answer:
A, C, and D are all valid bulk actions -> Option BQuick Check:
Bulk supports index, create, update actions [OK]
- Thinking only index action is allowed
- Confusing create and update JSON formats
- Missing newline between action and data lines
from elasticsearch import Elasticsearch, helpers
es = Elasticsearch()
docs = [
{"_index": "test", "_id": "1", "field": "value1"},
{"_index": "test", "_id": "2", "field": 123} # mapping error if field expects string
]
response = helpers.bulk(es, docs)
print(response)Solution
Step 1: Understand helpers.bulk behavior
helpers.bulk returns a tuple: (success_count, errors_list). It continues indexing even if some docs fail.Step 2: Analyze the documents
First doc is valid, second has a mapping error (wrong type). So one success, one error.Final Answer:
(1, [{"index": {"_id": "2", "error": "mapper_parsing_exception"}}]) -> Option DQuick Check:
One success, one mapping error = (1, [{"index": {"_id": "2", "error": "mapper_parsing_exception"}}]) [OK]
- Assuming bulk stops on first error
- Expecting a Python exception instead of error info
- Misreading success count as total docs
{ "index": { "_index": "myindex", "_id": "1" }
{ "field": "value" }Solution
Step 1: Check bulk request format
Bulk API requires each action line and data line to be separated by a newline character.Step 2: Identify the error
The given request misses a newline between the two JSON objects, causing parsing failure.Final Answer:
Missing newline between action and data lines -> Option CQuick Check:
Bulk lines must be separated by newlines [OK]
- Forgetting newline between JSON objects
- Adding commas between bulk lines
- Confusing index and create actions
Solution
Step 1: Consider bulk request size
Very large bulk requests (like 10,000 docs) can cause memory or timeout issues.Step 2: Choose batch size and error handling
Splitting into moderate batches (e.g., 500) balances speed and resource use. Checking errors after each batch ensures reliability.Final Answer:
Split documents into batches of 500, send each batch, and check for errors after each batch. -> Option AQuick Check:
Batching + error check = optimal bulk indexing [OK]
- Sending too large batches causing failures
- Ignoring errors during bulk indexing
- Sending very small batches losing speed benefits
