Overview - Hash-based sharding

What is it?

Hash-based sharding is a way to split a large database into smaller parts called shards. It uses a hash function to decide which shard stores each piece of data. This helps spread data evenly across many servers. It makes the database faster and able to handle more users.

Why it matters

Without hash-based sharding, databases can become slow and overloaded as data grows. This method solves the problem by balancing data evenly, so no single server gets too busy. It helps websites and apps stay fast and reliable even with lots of users and data.

Where it fits

Before learning hash-based sharding, you should understand basic database concepts and what sharding means. After this, you can learn about other sharding methods like range-based sharding and how to manage distributed databases.

Mental Model

Core Idea

Hash-based sharding uses a hash function to evenly distribute data across multiple database servers to balance load and improve performance.

Think of it like...

Imagine sorting mail by putting each letter into one of many mailboxes based on a secret code on the envelope. The code (hash) decides which mailbox (shard) gets the letter, so no mailbox gets too full.

┌───────────────┐
│   Data Item   │
└──────┬────────┘
       │
       ▼
┌───────────────┐
│  Hash Function│
└──────┬────────┘
       │
       ▼
┌──────┴───────┐   ┌──────┴───────┐   ┌──────┴───────┐
│   Shard 1    │   │   Shard 2    │   │   Shard 3    │
└──────────────┘   └──────────────┘   └──────────────┘

Build-Up - 7 Steps

1

FoundationWhat is Sharding in Databases

Concept: Introduces the basic idea of splitting data across multiple servers.

Sharding means breaking a big database into smaller parts called shards. Each shard holds a portion of the data. This helps the database handle more data and users by sharing the work across servers.

Result

You understand that sharding divides data to improve database scalability and performance.

Knowing what sharding is sets the stage for understanding how different methods decide where data goes.

2

FoundationUnderstanding Hash Functions

3

IntermediateHow Hash-based Sharding Works

4

IntermediateChoosing the Shard Key for Hashing

5

IntermediateHandling Data Growth and Rebalancing

6

AdvancedMongoDB’s Hash-based Sharding Implementation

7

ExpertTrade-offs and Limitations of Hash-based Sharding

Under the Hood

Hash-based sharding works by applying a hash function to the shard key of each data item. This hash value is then used to determine the shard by calculating the remainder when divided by the number of shards (modulo operation). Internally, MongoDB manages data in chunks, each representing a range of hash values. These chunks are assigned to shards and can be split or migrated to balance load. The system tracks chunk metadata to route queries to the correct shard.

Why designed this way?

Hash-based sharding was designed to solve uneven data distribution problems seen in range-based sharding. By using a hash function, data is spread evenly regardless of the shard key's natural order. This avoids hotspots where one shard gets too much data or traffic. The trade-off is losing data locality, but the gain is predictable, balanced load across shards.

┌───────────────┐
│  Data Item    │
│ (shard key)   │
└──────┬────────┘
       │
       ▼
┌───────────────┐
│  Hash Function│
│  (e.g. MD5)   │
└──────┬────────┘
       │ hash value
       ▼
┌───────────────┐
│ Modulo Number │
│ (hash % N)    │
└──────┬────────┘
       │ shard ID
       ▼
┌───────────────┐
│   Chunk Map   │
│ (hash ranges) │
└──────┬────────┘
       │
       ▼
┌───────────────┐   ┌───────────────┐   ┌───────────────┐
│   Shard 1    │   │   Shard 2    │   │   Shard 3    │
└──────────────┘   └──────────────┘   └──────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Does hash-based sharding keep related data together on the same shard? Commit to yes or no.

Common Belief:Hash-based sharding groups related data on the same shard for faster queries.

Tap to reveal reality

Quick: When adding a new shard, does hash-based sharding automatically keep all data in place? Commit to yes or no.

Common Belief:Adding shards does not require moving existing data because the hash function keeps data stable.

Tap to reveal reality

Quick: Can any field be used as a shard key in hash-based sharding? Commit to yes or no.

Common Belief:Any field can be used as a shard key and will distribute data evenly.

Tap to reveal reality

Quick: Is hash-based sharding always better than range-based sharding? Commit to yes or no.

Common Belief:Hash-based sharding is always the best because it balances data perfectly.

Tap to reveal reality

Expert Zone

1

Hash-based sharding requires careful chunk management to avoid excessive chunk splitting and migration overhead.

2

The choice of hash function affects collision rates and distribution uniformity, impacting performance subtly.

3

MongoDB’s balancer process runs in the background to move chunks, but its timing and thresholds can affect system responsiveness.

When NOT to use

Avoid hash-based sharding when your application needs to query ranges or related data frequently, as it scatters related records. Instead, use range-based sharding or zone sharding to keep related data together for faster queries.

Production Patterns

In production, hash-based sharding is often combined with careful shard key selection and monitoring of chunk sizes. Teams use automated balancer tuning and sometimes hybrid sharding strategies to optimize performance and scalability.

Connections

Consistent Hashing

Builds-on

Understanding consistent hashing helps grasp advanced sharding that minimizes data movement when scaling shards.

Load Balancing in Networks

Same pattern

Both use hashing to evenly distribute requests or data to avoid overload on any single server.

Modular Arithmetic in Mathematics

Underlying principle

Knowing modular arithmetic clarifies how hash values map to shard numbers, making the distribution predictable.

Common Pitfalls

#1Choosing a shard key with few unique values causing uneven data distribution.

Wrong approach:sh.shardCollection('users', { country: 'hashed' })

Correct approach:sh.shardCollection('users', { userId: 'hashed' })

Root cause:Misunderstanding that the shard key must have high cardinality to distribute data evenly.

#2Assuming data stays on the same shard after adding new shards.

Wrong approach:Adding shards without planning for data migration, expecting no impact.

Correct approach:Planning chunk migrations and rebalancing when adding shards to maintain balance.

Root cause:Not realizing that changing shard count changes hash modulo, requiring data movement.

#3Using hash-based sharding for queries needing data locality.

Wrong approach:Designing queries that join related data scattered across shards without considering performance.

Correct approach:Using range-based sharding or embedding related data to optimize locality.

Root cause:Ignoring the trade-off between even distribution and data locality in hash-based sharding.

Key Takeaways

Hash-based sharding uses a hash function on a shard key to evenly spread data across multiple servers.

Choosing a high-cardinality shard key is critical to avoid unbalanced shards and performance issues.

Hash-based sharding sacrifices data locality, which can slow queries needing related data from multiple shards.

Scaling with hash-based sharding requires rebalancing data because adding shards changes the hash distribution.

Understanding the trade-offs helps choose the right sharding method for your database workload.