Data Structures Theoryknowledge~6 mins

Load factor and rehashing in Data Structures Theory - Full Explanation

Choose your learning style9 modes available

Learn Why Deep Visual Practice Challenge Project Recall Time

Introduction

Imagine you have a big box with many small compartments to store your keys. But what happens when the box gets too full? Finding a key becomes slow and frustrating. This problem is similar to what happens in data structures called hash tables when they get crowded.

Explanation

Load Factor

Load factor is a number that tells us how full a hash table is. It is calculated by dividing the number of stored items by the total number of compartments (called buckets). A higher load factor means the table is more crowded, which can slow down searching and adding items.

Load factor measures how full a hash table is and affects its speed.

Why High Load Factor Causes Problems

When the load factor is high, many items end up in the same bucket, causing collisions. Collisions make it harder and slower to find or insert items because the hash table has to check multiple places. This reduces the efficiency of the hash table.

High load factor leads to collisions, slowing down operations.

Rehashing

Rehashing is the process of creating a bigger hash table and moving all existing items into it. This reduces the load factor by increasing the number of buckets. Rehashing helps keep the hash table fast by spreading items out more evenly.

Rehashing expands the table to reduce load factor and improve speed.

When Rehashing Happens

Rehashing usually happens automatically when the load factor crosses a certain limit, often around 0.7 or 70%. This threshold is chosen to balance memory use and speed. When triggered, the table size is increased, often doubled, and all items are reinserted.

Rehashing is triggered when load factor exceeds a set threshold to maintain performance.

Real World Analogy

Imagine a parking lot with a fixed number of spaces. When the lot is almost full, cars have to park farther away or in less convenient spots, making it harder to find your car. To fix this, the parking lot is expanded with more spaces, so cars can park comfortably and be found easily.

Load Factor → How full the parking lot is compared to its total spaces

High Load Factor Problems → Cars parked too close or in inconvenient spots, making it hard to find your car

Rehashing → Expanding the parking lot to add more spaces

Rehashing Trigger → Deciding to expand the lot when it reaches about 70% full

Diagram

┌───────────────┐       ┌───────────────┐       ┌───────────────┐
│ Small Hash    │       │ Load Factor   │       │ Rehashing     │
│ Table (Buckets)│─────▶│ Increases as  │─────▶│ Create Larger │
│  [ ] [ ] [ ]  │       │ items fill up │       │ Table & Move  │
│  [x] [x] [ ]  │       │              │       │ Items         │
└───────────────┘       └───────────────┘       └───────────────┘
         │                                          │
         ▼                                          ▼
┌─────────────────────────────┐          ┌─────────────────────────┐
│ Collisions Increase          │          │ Larger Table Has More   │
│ Searching Slows Down         │          │ Buckets, Reducing Load  │
└─────────────────────────────┘          │ Factor and Collisions   │
                                         └─────────────────────────┘

This diagram shows how a hash table fills up, causing load factor to increase, which leads to collisions and slower searches, triggering rehashing to create a larger table.

Key Facts

Load Factor → The ratio of stored items to total buckets in a hash table.

Collision → When two items hash to the same bucket in a hash table.

Rehashing → The process of resizing a hash table and redistributing its items.

Rehashing Threshold → The load factor value at which rehashing is triggered, often around 0.7.

Code Example

Data Structures Theory

class HashTable:
    def __init__(self, size=4):
        self.size = size
        self.buckets = [[] for _ in range(size)]
        self.count = 0

    def _hash(self, key):
        return hash(key) % self.size

    def insert(self, key, value):
        if self.load_factor() > 0.7:
            self.rehash()
        index = self._hash(key)
        for i, (k, v) in enumerate(self.buckets[index]):
            if k == key:
                self.buckets[index][i] = (key, value)
                return
        self.buckets[index].append((key, value))
        self.count += 1

    def load_factor(self):
        return self.count / self.size

    def rehash(self):
        old_buckets = self.buckets
        self.size *= 2
        self.buckets = [[] for _ in range(self.size)]
        self.count = 0
        for bucket in old_buckets:
            for key, value in bucket:
                self.insert(key, value)

    def __str__(self):
        return str(self.buckets)

ht = HashTable()
for i in range(10):
    ht.insert(f'key{i}', i)
print(f'Load factor after inserts: {ht.load_factor():.2f}')
print('Hash table buckets:', ht)

OutputSuccess

Common Confusions

Believing load factor is the number of collisions directly.

Believing load factor is the number of collisions directly. Load factor measures fullness, not collisions; collisions depend on load factor but also on hash function quality.

Thinking rehashing happens every time an item is added.

Thinking rehashing happens every time an item is added. Rehashing only occurs when the load factor crosses a set threshold, not on every insertion.

Summary

Load factor shows how full a hash table is and affects its speed.

High load factor causes collisions, making searches slower.

Rehashing creates a bigger table to reduce load factor and improve performance.