0
0
Elasticsearchquery~15 mins

Scroll API for deep pagination in Elasticsearch - Deep Dive

Choose your learning style9 modes available
Overview - Scroll API for deep pagination
What is it?
The Scroll API in Elasticsearch is a way to retrieve large sets of search results efficiently by breaking them into smaller batches called scrolls. It helps you go through many results without losing performance or missing data. Instead of fetching all results at once, it keeps a snapshot of the data and lets you scroll through it step-by-step.
Why it matters
Without the Scroll API, fetching large amounts of data would be slow and resource-heavy, often causing timeouts or incomplete results. This would make it hard to analyze or process big datasets in Elasticsearch. The Scroll API solves this by allowing deep pagination safely and efficiently, making large data retrieval practical and reliable.
Where it fits
Before learning the Scroll API, you should understand basic Elasticsearch search queries and simple pagination using from and size parameters. After mastering the Scroll API, you can explore alternatives like the Search After API and Point In Time (PIT) for more advanced or real-time use cases.
Mental Model
Core Idea
The Scroll API creates a stable snapshot of search results and lets you fetch them in small batches to handle deep pagination efficiently.
Think of it like...
Imagine reading a very long book but only carrying a small backpack. Instead of taking the whole book at once, you take a bookmark and read a few pages at a time, then come back later to continue exactly where you left off without losing your place.
┌───────────────┐
│ Initial Query │
└──────┬────────┘
       │
       ▼
┌─────────────────────────────┐
│ Scroll Context Created       │
│ (Snapshot of results)       │
└──────┬──────────────────────┘
       │
       ▼
┌───────────────┐   ┌───────────────┐   ┌───────────────┐
│ Batch 1       │→→│ Batch 2       │→→│ Batch 3       │→→ ...
│ (First scroll)│   │ (Next scroll) │   │ (Next scroll) │
└───────────────┘   └───────────────┘   └───────────────┘
Build-Up - 7 Steps
1
FoundationBasic Elasticsearch Search Query
🤔
Concept: Learn how to perform a simple search query in Elasticsearch to retrieve documents.
A basic search query uses the _search endpoint with a query body. For example, to find documents matching a term: POST /my_index/_search { "query": { "match": { "field": "value" } } } This returns a limited number of results (default 10).
Result
You get the first 10 matching documents from the index.
Understanding simple search queries is essential because the Scroll API builds on this by extending how results are retrieved.
2
FoundationLimitations of Simple Pagination
🤔
Concept: Understand why using from and size parameters for pagination is inefficient for deep pages.
Pagination with from and size looks like this: GET /my_index/_search { "from": 1000, "size": 10, "query": { "match_all": {} } } But as from increases, Elasticsearch must skip many documents, causing slow queries and high memory use.
Result
Deep pages take longer to load and can cause timeouts or errors.
Knowing these limits motivates the need for a better method like the Scroll API for deep pagination.
3
IntermediateCreating a Scroll Context
🤔Before reading on: do you think the Scroll API fetches all data at once or keeps a snapshot to fetch in parts? Commit to your answer.
Concept: Learn how to start a scroll by creating a scroll context that holds a snapshot of the search results.
You start scrolling by adding a scroll parameter to your search request: POST /my_index/_search?scroll=1m { "size": 100, "query": { "match_all": {} } } This returns the first batch of 100 results and a scroll_id to fetch the next batch.
Result
You receive the first 100 documents and a scroll_id token to continue scrolling.
Understanding that the scroll context is a snapshot prevents confusion about data changing during scrolling.
4
IntermediateFetching Next Batches with Scroll ID
🤔Before reading on: do you think the scroll_id changes after each batch or stays the same? Commit to your answer.
Concept: Learn how to use the scroll_id to retrieve subsequent batches of results.
Use the scroll_id from the previous response to get the next batch: POST /_search/scroll { "scroll": "1m", "scroll_id": "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAA..." } Each call returns the next batch and a new scroll_id for the following batch.
Result
You get the next set of documents and an updated scroll_id until no more results remain.
Knowing the scroll_id updates each time helps manage the scrolling process correctly.
5
IntermediateClearing Scroll Contexts
🤔
Concept: Learn how to clear scroll contexts to free resources when done scrolling.
After finishing scrolling, clear the scroll context: DELETE /_search/scroll { "scroll_id" : ["DXF1ZXJ5QW5kRmV0Y2gBAAAAAAA..."] } This releases memory and resources on the Elasticsearch server.
Result
Scroll context is removed, preventing resource leaks.
Understanding resource cleanup avoids performance issues in production.
6
AdvancedScroll API vs Search After for Pagination
🤔Before reading on: do you think Scroll API is better for real-time data or for static snapshots? Commit to your answer.
Concept: Compare Scroll API with Search After to understand when to use each for pagination.
Scroll API keeps a snapshot and is good for exporting large static datasets. Search After uses sort values to paginate and is better for real-time data where results may change. Scroll API can be slower and uses more resources if kept open too long.
Result
You know which pagination method fits your use case best.
Knowing the tradeoffs helps choose the right tool for performance and data freshness.
7
ExpertScroll API Internals and Performance Implications
🤔Before reading on: do you think the Scroll API re-executes the query on each scroll request? Commit to your answer.
Concept: Understand how Elasticsearch maintains the scroll context internally and its impact on cluster resources.
When you start a scroll, Elasticsearch takes a point-in-time snapshot of the index data. It keeps this snapshot alive for the scroll duration, preventing segment merges that would affect results. Each scroll request fetches the next batch from this snapshot without re-running the query. However, keeping many scroll contexts open consumes memory and file handles, so they should be cleared promptly.
Result
You understand why scrolls are consistent but resource-intensive.
Understanding the snapshot mechanism explains why scrolls are stable but must be managed carefully to avoid cluster strain.
Under the Hood
The Scroll API works by creating a point-in-time snapshot of the index data at the moment the scroll starts. This snapshot freezes the data view, so even if documents change, the scroll sees a consistent set. Elasticsearch keeps this snapshot alive by preventing segment merges and maintaining internal resources. Each scroll request uses the scroll_id to fetch the next batch from this snapshot without re-executing the query, ensuring stable and efficient retrieval.
Why designed this way?
The Scroll API was designed to solve the problem of deep pagination in large datasets where normal pagination is inefficient. By using a snapshot, it avoids the cost of re-running queries and skipping documents repeatedly. Alternatives like simple from/size pagination were too slow for large offsets. The tradeoff is resource usage, so scroll contexts have expiration times and must be cleared to free resources.
┌───────────────┐
│ User Query    │
└──────┬────────┘
       │
       ▼
┌─────────────────────────────┐
│ Elasticsearch Snapshot       │
│ (Point-in-time view)         │
└──────┬──────────────────────┘
       │
       ▼
┌───────────────┐   ┌───────────────┐
│ Scroll Batch 1│→→│ Scroll Batch 2│→→ ...
└───────────────┘   └───────────────┘
       │
       ▼
┌─────────────────────────────┐
│ Scroll Context Maintained    │
│ (Prevents merges, holds data)│
└─────────────────────────────┘
Myth Busters - 4 Common Misconceptions
Quick: Does the Scroll API reflect real-time changes to data during scrolling? Commit yes or no.
Common Belief:The Scroll API always shows the latest data changes as you scroll.
Tap to reveal reality
Reality:The Scroll API shows a fixed snapshot of data from when the scroll started, ignoring changes made after.
Why it matters:Expecting live updates can cause confusion and incorrect assumptions about data freshness.
Quick: Can you keep a scroll context open indefinitely without issues? Commit yes or no.
Common Belief:Scroll contexts can be kept open as long as needed without affecting performance.
Tap to reveal reality
Reality:Scroll contexts consume cluster resources and should be cleared promptly to avoid memory and file handle exhaustion.
Why it matters:Leaving scrolls open too long can degrade cluster performance or cause failures.
Quick: Does the scroll_id remain the same throughout the entire scroll session? Commit yes or no.
Common Belief:The scroll_id returned at the start is used for all subsequent scroll requests unchanged.
Tap to reveal reality
Reality:Each scroll response returns a new scroll_id that must be used for the next request.
Why it matters:Using an old scroll_id causes errors and breaks the scrolling process.
Quick: Is the Scroll API suitable for real-time user-facing pagination? Commit yes or no.
Common Belief:Scroll API is ideal for all pagination needs, including real-time user interfaces.
Tap to reveal reality
Reality:Scroll API is designed for batch processing and exports, not for real-time user pagination where Search After or PIT are better.
Why it matters:Using Scroll API for real-time UI can cause stale data and poor user experience.
Expert Zone
1
Scroll contexts prevent segment merges on the index, which can increase disk space usage temporarily.
2
The scroll timeout resets with each scroll request, so frequent requests keep the context alive longer.
3
Scroll API is not optimized for sorting on fields with high cardinality; this can impact performance.
When NOT to use
Avoid Scroll API for real-time or frequently updated data views; use Search After or Point In Time (PIT) queries instead. Also, do not use Scroll API for small result sets where simple pagination suffices.
Production Patterns
In production, Scroll API is commonly used for exporting large datasets, reindexing data, or batch processing jobs. It is paired with careful scroll context management and resource monitoring to avoid cluster strain.
Connections
Cursor-based Pagination
Scroll API is a form of cursor-based pagination used in databases and APIs.
Understanding cursor-based pagination in APIs helps grasp how Scroll API maintains position without skipping data.
Snapshot Isolation in Databases
Scroll API uses a snapshot of data similar to snapshot isolation in databases to provide consistent reads.
Knowing snapshot isolation explains why Scroll API results remain stable despite concurrent data changes.
Streaming Data Processing
Scroll API enables streaming large datasets in batches, similar to streaming processing in big data systems.
Recognizing this connection helps appreciate how Scroll API supports scalable data workflows.
Common Pitfalls
#1Using from and size for deep pagination on large datasets.
Wrong approach:GET /my_index/_search { "from": 10000, "size": 10, "query": { "match_all": {} } }
Correct approach:POST /my_index/_search?scroll=1m { "size": 100, "query": { "match_all": {} } }
Root cause:Misunderstanding that from/size pagination becomes inefficient and slow for large offsets.
#2Not using the updated scroll_id for subsequent scroll requests.
Wrong approach:POST /_search/scroll { "scroll": "1m", "scroll_id": "old_scroll_id" }
Correct approach:POST /_search/scroll { "scroll": "1m", "scroll_id": "new_scroll_id_from_last_response" }
Root cause:Assuming scroll_id is static instead of updated after each batch.
#3Leaving scroll contexts open indefinitely without clearing.
Wrong approach:Never calling DELETE /_search/scroll after finishing scrolling.
Correct approach:DELETE /_search/scroll { "scroll_id" : ["scroll_id_to_clear"] }
Root cause:Not understanding resource consumption and cleanup requirements of scroll contexts.
Key Takeaways
The Scroll API provides a way to retrieve large search results in batches by creating a stable snapshot of data.
It is designed to solve the inefficiency of deep pagination using from and size parameters in Elasticsearch.
Each scroll request returns a new scroll_id that must be used for the next batch to continue scrolling.
Scroll contexts consume cluster resources and must be cleared promptly to avoid performance issues.
Scroll API is best suited for batch processing and exports, not for real-time user-facing pagination.