Bird
Raised Fist0
Elasticsearchquery~15 mins

Scroll API for deep pagination in Elasticsearch - Deep Dive

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Overview - Scroll API for deep pagination
What is it?
The Scroll API in Elasticsearch is a way to retrieve large sets of search results efficiently by breaking them into smaller batches called scrolls. It helps you go through many results without losing performance or missing data. Instead of fetching all results at once, it keeps a snapshot of the data and lets you scroll through it step-by-step.
Why it matters
Without the Scroll API, fetching large amounts of data would be slow and resource-heavy, often causing timeouts or incomplete results. This would make it hard to analyze or process big datasets in Elasticsearch. The Scroll API solves this by allowing deep pagination safely and efficiently, making large data retrieval practical and reliable.
Where it fits
Before learning the Scroll API, you should understand basic Elasticsearch search queries and simple pagination using from and size parameters. After mastering the Scroll API, you can explore alternatives like the Search After API and Point In Time (PIT) for more advanced or real-time use cases.
Mental Model
Core Idea
The Scroll API creates a stable snapshot of search results and lets you fetch them in small batches to handle deep pagination efficiently.
Think of it like...
Imagine reading a very long book but only carrying a small backpack. Instead of taking the whole book at once, you take a bookmark and read a few pages at a time, then come back later to continue exactly where you left off without losing your place.
┌───────────────┐
│ Initial Query │
└──────┬────────┘
       │
       ▼
┌─────────────────────────────┐
│ Scroll Context Created       │
│ (Snapshot of results)       │
└──────┬──────────────────────┘
       │
       ▼
┌───────────────┐   ┌───────────────┐   ┌───────────────┐
│ Batch 1       │→→│ Batch 2       │→→│ Batch 3       │→→ ...
│ (First scroll)│   │ (Next scroll) │   │ (Next scroll) │
└───────────────┘   └───────────────┘   └───────────────┘
Build-Up - 7 Steps
1
FoundationBasic Elasticsearch Search Query
🤔
Concept: Learn how to perform a simple search query in Elasticsearch to retrieve documents.
A basic search query uses the _search endpoint with a query body. For example, to find documents matching a term: POST /my_index/_search { "query": { "match": { "field": "value" } } } This returns a limited number of results (default 10).
Result
You get the first 10 matching documents from the index.
Understanding simple search queries is essential because the Scroll API builds on this by extending how results are retrieved.
2
FoundationLimitations of Simple Pagination
🤔
Concept: Understand why using from and size parameters for pagination is inefficient for deep pages.
Pagination with from and size looks like this: GET /my_index/_search { "from": 1000, "size": 10, "query": { "match_all": {} } } But as from increases, Elasticsearch must skip many documents, causing slow queries and high memory use.
Result
Deep pages take longer to load and can cause timeouts or errors.
Knowing these limits motivates the need for a better method like the Scroll API for deep pagination.
3
IntermediateCreating a Scroll Context
🤔Before reading on: do you think the Scroll API fetches all data at once or keeps a snapshot to fetch in parts? Commit to your answer.
Concept: Learn how to start a scroll by creating a scroll context that holds a snapshot of the search results.
You start scrolling by adding a scroll parameter to your search request: POST /my_index/_search?scroll=1m { "size": 100, "query": { "match_all": {} } } This returns the first batch of 100 results and a scroll_id to fetch the next batch.
Result
You receive the first 100 documents and a scroll_id token to continue scrolling.
Understanding that the scroll context is a snapshot prevents confusion about data changing during scrolling.
4
IntermediateFetching Next Batches with Scroll ID
🤔Before reading on: do you think the scroll_id changes after each batch or stays the same? Commit to your answer.
Concept: Learn how to use the scroll_id to retrieve subsequent batches of results.
Use the scroll_id from the previous response to get the next batch: POST /_search/scroll { "scroll": "1m", "scroll_id": "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAA..." } Each call returns the next batch and a new scroll_id for the following batch.
Result
You get the next set of documents and an updated scroll_id until no more results remain.
Knowing the scroll_id updates each time helps manage the scrolling process correctly.
5
IntermediateClearing Scroll Contexts
🤔
Concept: Learn how to clear scroll contexts to free resources when done scrolling.
After finishing scrolling, clear the scroll context: DELETE /_search/scroll { "scroll_id" : ["DXF1ZXJ5QW5kRmV0Y2gBAAAAAAA..."] } This releases memory and resources on the Elasticsearch server.
Result
Scroll context is removed, preventing resource leaks.
Understanding resource cleanup avoids performance issues in production.
6
AdvancedScroll API vs Search After for Pagination
🤔Before reading on: do you think Scroll API is better for real-time data or for static snapshots? Commit to your answer.
Concept: Compare Scroll API with Search After to understand when to use each for pagination.
Scroll API keeps a snapshot and is good for exporting large static datasets. Search After uses sort values to paginate and is better for real-time data where results may change. Scroll API can be slower and uses more resources if kept open too long.
Result
You know which pagination method fits your use case best.
Knowing the tradeoffs helps choose the right tool for performance and data freshness.
7
ExpertScroll API Internals and Performance Implications
🤔Before reading on: do you think the Scroll API re-executes the query on each scroll request? Commit to your answer.
Concept: Understand how Elasticsearch maintains the scroll context internally and its impact on cluster resources.
When you start a scroll, Elasticsearch takes a point-in-time snapshot of the index data. It keeps this snapshot alive for the scroll duration, preventing segment merges that would affect results. Each scroll request fetches the next batch from this snapshot without re-running the query. However, keeping many scroll contexts open consumes memory and file handles, so they should be cleared promptly.
Result
You understand why scrolls are consistent but resource-intensive.
Understanding the snapshot mechanism explains why scrolls are stable but must be managed carefully to avoid cluster strain.
Under the Hood
The Scroll API works by creating a point-in-time snapshot of the index data at the moment the scroll starts. This snapshot freezes the data view, so even if documents change, the scroll sees a consistent set. Elasticsearch keeps this snapshot alive by preventing segment merges and maintaining internal resources. Each scroll request uses the scroll_id to fetch the next batch from this snapshot without re-executing the query, ensuring stable and efficient retrieval.
Why designed this way?
The Scroll API was designed to solve the problem of deep pagination in large datasets where normal pagination is inefficient. By using a snapshot, it avoids the cost of re-running queries and skipping documents repeatedly. Alternatives like simple from/size pagination were too slow for large offsets. The tradeoff is resource usage, so scroll contexts have expiration times and must be cleared to free resources.
┌───────────────┐
│ User Query    │
└──────┬────────┘
       │
       ▼
┌─────────────────────────────┐
│ Elasticsearch Snapshot       │
│ (Point-in-time view)         │
└──────┬──────────────────────┘
       │
       ▼
┌───────────────┐   ┌───────────────┐
│ Scroll Batch 1│→→│ Scroll Batch 2│→→ ...
└───────────────┘   └───────────────┘
       │
       ▼
┌─────────────────────────────┐
│ Scroll Context Maintained    │
│ (Prevents merges, holds data)│
└─────────────────────────────┘
Myth Busters - 4 Common Misconceptions
Quick: Does the Scroll API reflect real-time changes to data during scrolling? Commit yes or no.
Common Belief:The Scroll API always shows the latest data changes as you scroll.
Tap to reveal reality
Reality:The Scroll API shows a fixed snapshot of data from when the scroll started, ignoring changes made after.
Why it matters:Expecting live updates can cause confusion and incorrect assumptions about data freshness.
Quick: Can you keep a scroll context open indefinitely without issues? Commit yes or no.
Common Belief:Scroll contexts can be kept open as long as needed without affecting performance.
Tap to reveal reality
Reality:Scroll contexts consume cluster resources and should be cleared promptly to avoid memory and file handle exhaustion.
Why it matters:Leaving scrolls open too long can degrade cluster performance or cause failures.
Quick: Does the scroll_id remain the same throughout the entire scroll session? Commit yes or no.
Common Belief:The scroll_id returned at the start is used for all subsequent scroll requests unchanged.
Tap to reveal reality
Reality:Each scroll response returns a new scroll_id that must be used for the next request.
Why it matters:Using an old scroll_id causes errors and breaks the scrolling process.
Quick: Is the Scroll API suitable for real-time user-facing pagination? Commit yes or no.
Common Belief:Scroll API is ideal for all pagination needs, including real-time user interfaces.
Tap to reveal reality
Reality:Scroll API is designed for batch processing and exports, not for real-time user pagination where Search After or PIT are better.
Why it matters:Using Scroll API for real-time UI can cause stale data and poor user experience.
Expert Zone
1
Scroll contexts prevent segment merges on the index, which can increase disk space usage temporarily.
2
The scroll timeout resets with each scroll request, so frequent requests keep the context alive longer.
3
Scroll API is not optimized for sorting on fields with high cardinality; this can impact performance.
When NOT to use
Avoid Scroll API for real-time or frequently updated data views; use Search After or Point In Time (PIT) queries instead. Also, do not use Scroll API for small result sets where simple pagination suffices.
Production Patterns
In production, Scroll API is commonly used for exporting large datasets, reindexing data, or batch processing jobs. It is paired with careful scroll context management and resource monitoring to avoid cluster strain.
Connections
Cursor-based Pagination
Scroll API is a form of cursor-based pagination used in databases and APIs.
Understanding cursor-based pagination in APIs helps grasp how Scroll API maintains position without skipping data.
Snapshot Isolation in Databases
Scroll API uses a snapshot of data similar to snapshot isolation in databases to provide consistent reads.
Knowing snapshot isolation explains why Scroll API results remain stable despite concurrent data changes.
Streaming Data Processing
Scroll API enables streaming large datasets in batches, similar to streaming processing in big data systems.
Recognizing this connection helps appreciate how Scroll API supports scalable data workflows.
Common Pitfalls
#1Using from and size for deep pagination on large datasets.
Wrong approach:GET /my_index/_search { "from": 10000, "size": 10, "query": { "match_all": {} } }
Correct approach:POST /my_index/_search?scroll=1m { "size": 100, "query": { "match_all": {} } }
Root cause:Misunderstanding that from/size pagination becomes inefficient and slow for large offsets.
#2Not using the updated scroll_id for subsequent scroll requests.
Wrong approach:POST /_search/scroll { "scroll": "1m", "scroll_id": "old_scroll_id" }
Correct approach:POST /_search/scroll { "scroll": "1m", "scroll_id": "new_scroll_id_from_last_response" }
Root cause:Assuming scroll_id is static instead of updated after each batch.
#3Leaving scroll contexts open indefinitely without clearing.
Wrong approach:Never calling DELETE /_search/scroll after finishing scrolling.
Correct approach:DELETE /_search/scroll { "scroll_id" : ["scroll_id_to_clear"] }
Root cause:Not understanding resource consumption and cleanup requirements of scroll contexts.
Key Takeaways
The Scroll API provides a way to retrieve large search results in batches by creating a stable snapshot of data.
It is designed to solve the inefficiency of deep pagination using from and size parameters in Elasticsearch.
Each scroll request returns a new scroll_id that must be used for the next batch to continue scrolling.
Scroll contexts consume cluster resources and must be cleared promptly to avoid performance issues.
Scroll API is best suited for batch processing and exports, not for real-time user-facing pagination.

Practice

(1/5)
1. What is the main purpose of the Scroll API in Elasticsearch?
easy
A. To retrieve large sets of search results in small, manageable batches.
B. To update documents in bulk efficiently.
C. To delete old indices automatically.
D. To create new indices with custom mappings.

Solution

  1. Step 1: Understand Scroll API usage

    The Scroll API is designed to handle large result sets by breaking them into smaller parts.
  2. Step 2: Compare options with Scroll API purpose

    Options B, C, and D relate to other Elasticsearch features, not scrolling.
  3. Final Answer:

    To retrieve large sets of search results in small, manageable batches. -> Option A
  4. Quick Check:

    Scroll API = batch retrieval [OK]
Hint: Scroll API = fetch big results in small parts [OK]
Common Mistakes:
  • Confusing Scroll API with bulk update operations
  • Thinking Scroll API deletes or creates indices
  • Assuming Scroll API returns all results at once
2. Which of the following is the correct way to start a scroll search request in Elasticsearch using JSON?
easy
A. {"scroll_id": "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAA", "size": 100}
B. {"query": {"match_all": {}}, "scroll": "1m", "size": 100}
C. {"query": {"match": {"field": "value"}}, "timeout": "1m"}
D. {"scroll": "1m", "update": true}

Solution

  1. Step 1: Identify scroll search syntax

    Starting a scroll requires a query, a scroll time, and size for batch size.
  2. Step 2: Analyze options

    {"query": {"match_all": {}}, "scroll": "1m", "size": 100} includes query, scroll time, and size correctly. {"scroll_id": "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAA", "size": 100} uses scroll_id which is for continuing scroll, not starting. {"query": {"match": {"field": "value"}}, "timeout": "1m"} lacks scroll parameter. {"scroll": "1m", "update": true} has invalid update field.
  3. Final Answer:

    {"query": {"match_all": {}}, "scroll": "1m", "size": 100} -> Option B
  4. Quick Check:

    Start scroll = query + scroll + size [OK]
Hint: Start scroll with query + scroll + size keys [OK]
Common Mistakes:
  • Using scroll_id to start scroll instead of continue
  • Omitting the scroll parameter
  • Confusing scroll with timeout or update
3. Given the following scroll response snippet, what is the correct next step to fetch more results?
{
  "_scroll_id": "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAA",
  "hits": {"hits": [{"_id": "1"}, {"_id": "2"}]}
}
medium
A. Send a new search request without scroll_id.
B. Delete the scroll_id to reset the scroll context.
C. Use the scroll_id in a subsequent scroll request with the scroll parameter.
D. Use the hits array to manually fetch documents by ID.

Solution

  1. Step 1: Understand scroll continuation

    To get next batch, use the scroll_id from previous response with scroll parameter.
  2. Step 2: Evaluate options

    Use the scroll_id in a subsequent scroll request with the scroll parameter. correctly describes using scroll_id and scroll to continue. Send a new search request without scroll_id. restarts search, losing context. Delete the scroll_id to reset the scroll context. is incorrect as deleting scroll_id is not valid. Use the hits array to manually fetch documents by ID. is manual and inefficient.
  3. Final Answer:

    Use the scroll_id in a subsequent scroll request with the scroll parameter. -> Option C
  4. Quick Check:

    Next scroll = scroll_id + scroll [OK]
Hint: Use scroll_id + scroll param to get next batch [OK]
Common Mistakes:
  • Restarting search instead of continuing scroll
  • Ignoring scroll parameter in next request
  • Trying to fetch documents manually by ID
4. You wrote this scroll request but get an error: {"scroll_id": "abc123"}. What is the likely cause?
medium
A. Missing the scroll parameter to keep the scroll context alive.
B. The scroll_id is invalid and must be a number.
C. You cannot use scroll_id in a scroll request.
D. The size parameter is required with scroll_id.

Solution

  1. Step 1: Check scroll request requirements

    When continuing a scroll, the scroll parameter (time) must be included to keep context alive.
  2. Step 2: Analyze error cause

    Missing the scroll parameter to keep the scroll context alive. correctly identifies missing scroll parameter. The scroll_id is invalid and must be a number. is wrong; scroll_id is a string. You cannot use scroll_id in a scroll request. is false; scroll_id is needed. The size parameter is required with scroll_id. is incorrect; size is not required in scroll continuation.
  3. Final Answer:

    Missing the scroll parameter to keep the scroll context alive. -> Option A
  4. Quick Check:

    Scroll continuation needs scroll param [OK]
Hint: Always include scroll param with scroll_id [OK]
Common Mistakes:
  • Omitting scroll parameter in scroll continuation
  • Assuming scroll_id must be numeric
  • Thinking size is needed every scroll request
5. You want to retrieve 10,000 documents using the Scroll API. Which approach is best to avoid memory issues and ensure all documents are retrieved?
hard
A. Use the Scroll API but do not specify the scroll parameter to speed up retrieval.
B. Set size to 10,000 in a single search request without scrolling.
C. Fetch documents by IDs one by one using separate queries.
D. Use a scroll time of 1 minute and fetch batches of 100 documents repeatedly until no hits remain.

Solution

  1. Step 1: Understand deep pagination with Scroll API

    Scroll API is designed to fetch large results in small batches with a scroll timeout to keep context alive.
  2. Step 2: Evaluate options for best practice

    Use a scroll time of 1 minute and fetch batches of 100 documents repeatedly until no hits remain. correctly uses scroll time and batch size to safely retrieve all documents. Set size to 10,000 in a single search request without scrolling. risks memory overload. Use the Scroll API but do not specify the scroll parameter to speed up retrieval. is invalid because scroll param is required. Fetch documents by IDs one by one using separate queries. is inefficient and slow.
  3. Final Answer:

    Use a scroll time of 1 minute and fetch batches of 100 documents repeatedly until no hits remain. -> Option D
  4. Quick Check:

    Scroll API + batch + scroll time = safe deep pagination [OK]
Hint: Fetch in batches with scroll time to avoid overload [OK]
Common Mistakes:
  • Requesting all documents at once causing memory errors
  • Omitting scroll parameter to speed up
  • Fetching documents individually instead of batches