0
0
Elasticsearchquery~15 mins

Why documents are the unit of data in Elasticsearch - Why It Works This Way

Choose your learning style9 modes available
Overview - Why documents are the unit of data
What is it?
In Elasticsearch, data is stored and managed as documents. A document is a basic unit of information that contains structured data in a format like JSON. Each document represents a single entity or record, such as a user profile or a product listing. This approach makes it easy to store, search, and retrieve data efficiently.
Why it matters
Using documents as the unit of data allows Elasticsearch to handle complex and varied information flexibly. Without documents, data would be stored in rigid tables or rows, making it harder to index and search quickly. Documents enable fast, full-text search and easy scaling, which is crucial for applications like search engines and analytics platforms.
Where it fits
Before learning why documents are the unit of data, you should understand basic data storage concepts and JSON format. After this, you can explore how Elasticsearch indexes documents and performs searches, and then learn about mapping and querying documents in detail.
Mental Model
Core Idea
A document is a self-contained package of data that Elasticsearch stores, indexes, and searches as a single unit.
Think of it like...
Think of a document like a single page in a filing cabinet, where each page holds all the information about one person or item. Instead of searching through the whole cabinet, you quickly find the right page with the details you need.
┌───────────────┐
│   Document    │
│ ┌───────────┐ │
│ │ Field 1   │ │
│ │ Field 2   │ │
│ │ Field 3   │ │
│ └───────────┘ │
└───────────────┘
Each document contains multiple fields with data.
Build-Up - 6 Steps
1
FoundationUnderstanding the Document Concept
🤔
Concept: Introduce what a document is in Elasticsearch and why it is the basic data unit.
A document in Elasticsearch is a JSON object that holds data about one entity. For example, a document could represent a book with fields like title, author, and year. Documents are stored in indexes and can be searched individually.
Result
You understand that documents are the smallest pieces of data Elasticsearch works with.
Knowing that documents are self-contained helps you see why Elasticsearch can quickly find and retrieve data without scanning unrelated information.
2
FoundationJSON Format as Document Structure
🤔
Concept: Explain why JSON is used to structure documents.
Documents use JSON because it is easy to read and write, supports nested data, and works well with web technologies. JSON fields can be simple values or complex objects, allowing flexible data representation.
Result
You can recognize how data is organized inside a document and why JSON fits this role.
Understanding JSON's flexibility clarifies how documents can represent diverse data types and structures.
3
IntermediateDocuments vs. Traditional Rows
🤔Before reading on: do you think documents are just like rows in a table or something different? Commit to your answer.
Concept: Compare documents with traditional database rows to highlight differences.
Unlike rows in a table, documents can have different fields and nested data. This means each document can be unique, and Elasticsearch does not require a fixed schema. This flexibility supports varied and evolving data.
Result
You see that documents allow more adaptable data storage than rigid tables.
Knowing this difference helps you appreciate why Elasticsearch is suited for dynamic and complex data.
4
IntermediateIndexing Documents for Fast Search
🤔Before reading on: do you think Elasticsearch searches documents by scanning all data or using an index? Commit to your answer.
Concept: Introduce how documents are indexed to enable quick searching.
When a document is added, Elasticsearch creates an index of its fields and values. This index is like a map that points to where data lives, so searches can jump directly to matching documents without scanning everything.
Result
You understand that indexing makes document search fast and efficient.
Understanding indexing reveals why documents are practical units for search engines.
5
AdvancedDocument Immutability and Updates
🤔Before reading on: do you think Elasticsearch changes documents in place or replaces them? Commit to your answer.
Concept: Explain how documents are updated and why they are treated as immutable.
Elasticsearch treats documents as immutable, meaning it does not change them directly. Instead, it creates a new version and marks the old one for deletion. This approach simplifies concurrency and indexing but means updates are actually replacements.
Result
You learn how document immutability affects data updates and performance.
Knowing this prevents confusion about how Elasticsearch handles changes and why some operations may be slower.
6
ExpertDocument Storage and Sharding Internals
🤔Before reading on: do you think documents are stored all together or split across shards? Commit to your answer.
Concept: Dive into how documents are physically stored and distributed in Elasticsearch clusters.
Documents are stored in shards, which are parts of an index spread across nodes. Each shard holds a subset of documents. This distribution allows Elasticsearch to scale horizontally and handle large data volumes efficiently.
Result
You understand the internal storage and distribution of documents in Elasticsearch.
Understanding sharding clarifies how Elasticsearch balances load and maintains performance at scale.
Under the Hood
Elasticsearch stores each document as a JSON object indexed by an inverted index structure. When a document is added, its fields are tokenized and mapped to terms in the index, pointing to the document's location. Documents are immutable; updates create new versions. The index is split into shards, each managed by a node, enabling distributed storage and search.
Why designed this way?
Documents as units allow flexible, schema-less data storage suited for varied real-world data. Immutability simplifies concurrency and indexing. Sharding supports horizontal scaling and fault tolerance. Alternatives like fixed tables or mutable records would limit flexibility and scalability.
┌───────────────┐       ┌───────────────┐       ┌───────────────┐
│   Document 1  │──────▶│   Shard 1     │──────▶│   Node A      │
│   Document 2  │──────▶│   Shard 2     │──────▶│   Node B      │
│   Document 3  │──────▶│   Shard 3     │──────▶│   Node C      │
└───────────────┘       └───────────────┘       └───────────────┘
Each document is indexed and stored in a shard distributed across nodes.
Myth Busters - 4 Common Misconceptions
Quick: Do you think all documents in an index must have the same fields? Commit to yes or no.
Common Belief:All documents in an Elasticsearch index must have the same fields and structure.
Tap to reveal reality
Reality:Documents in the same index can have different fields and structures because Elasticsearch is schema-flexible.
Why it matters:Assuming uniform fields limits how you design your data and can cause confusion when indexing diverse data.
Quick: Do you think updating a document changes it instantly in place? Commit to yes or no.
Common Belief:When you update a document, Elasticsearch modifies it directly in storage.
Tap to reveal reality
Reality:Elasticsearch treats documents as immutable and replaces the old document with a new version during updates.
Why it matters:Misunderstanding this can lead to wrong expectations about update speed and data consistency.
Quick: Do you think Elasticsearch searches all documents linearly? Commit to yes or no.
Common Belief:Elasticsearch searches by scanning every document one by one.
Tap to reveal reality
Reality:Elasticsearch uses inverted indexes to jump directly to matching documents without scanning all data.
Why it matters:Believing in linear search underestimates Elasticsearch's speed and can misguide optimization efforts.
Quick: Do you think documents are stored on a single server only? Commit to yes or no.
Common Belief:All documents in an index are stored on one server or node.
Tap to reveal reality
Reality:Documents are split across multiple shards and nodes to distribute load and increase reliability.
Why it matters:Ignoring this can cause misunderstandings about scaling and fault tolerance.
Expert Zone
1
Documents can have nested objects and arrays, but querying nested fields requires special handling to avoid incorrect matches.
2
Elasticsearch merges segments of indexed documents in the background to optimize search speed and storage, which affects how quickly deleted documents free space.
3
Mapping conflicts can occur if documents with the same field name have different data types, requiring careful index design.
When NOT to use
Using documents as the unit is less suitable when strict relational integrity or complex multi-table joins are needed; traditional relational databases with normalized tables are better in those cases.
Production Patterns
In production, documents are designed to be denormalized, containing all needed data to avoid joins. Index templates and mappings enforce field types. Shard count and replication are tuned for performance and reliability. Bulk APIs are used for efficient document ingestion.
Connections
JSON Data Format
Documents are structured as JSON objects, building directly on JSON syntax and semantics.
Understanding JSON helps grasp how documents store complex, nested data flexibly.
Inverted Index
Documents are indexed using inverted indexes to enable fast full-text search.
Knowing inverted indexes explains why documents can be searched quickly despite large data volumes.
Library Cataloging Systems
Like documents in Elasticsearch, library cards represent individual books with metadata for quick lookup.
Seeing documents as catalog cards helps understand how indexing and retrieval work in search systems.
Common Pitfalls
#1Trying to update a document by changing fields directly without reindexing.
Wrong approach:POST /index/_update/1 { "doc": { "title": "New Title" } } // expecting in-place change
Correct approach:POST /index/_doc/1 { "title": "New Title", "other_fields": "..." } // reindex entire document
Root cause:Misunderstanding that Elasticsearch treats documents as immutable and requires full document replacement on update.
#2Assuming all documents must have identical fields and failing to index documents with new fields.
Wrong approach:Indexing documents with different fields without updating mappings, causing errors or ignored fields.
Correct approach:Define dynamic mappings or update mappings to accommodate new fields before indexing varied documents.
Root cause:Confusing Elasticsearch's flexible schema with fixed relational schemas.
#3Searching without using the inverted index, expecting linear scan.
Wrong approach:Using scripts or filters that force scanning all documents for simple keyword search.
Correct approach:Use full-text queries that leverage inverted indexes for efficient search.
Root cause:Not leveraging Elasticsearch's indexing capabilities properly.
Key Takeaways
Documents are the basic units of data in Elasticsearch, storing all information about one entity in JSON format.
Using documents allows flexible, schema-less data storage that can handle varied and nested data easily.
Documents are indexed for fast search using inverted indexes, enabling quick retrieval without scanning all data.
Elasticsearch treats documents as immutable, replacing them on updates to simplify concurrency and indexing.
Documents are distributed across shards and nodes, allowing Elasticsearch to scale and remain reliable.