What is heartbeat in distributed system

HldConceptBeginner · 3 min read

Heartbeat in Distributed System: Definition and Usage

In a distributed system, a heartbeat is a regular signal sent between nodes to indicate they are alive and functioning. It helps detect failures quickly by monitoring these periodic messages.

⚙️

How It Works

Imagine a group of friends on a hiking trip who agree to check in with each other every 10 minutes to say "I'm okay." This check-in is like a heartbeat in a distributed system. Each computer or node sends a small message at regular intervals to a central monitor or to each other.

If a node misses sending its heartbeat, the system assumes something is wrong, like the node crashed or lost connection. This helps the system react quickly, such as by restarting the node or shifting work to others, keeping the whole system healthy.

💻

Example

This example shows a simple Python script where a node sends a heartbeat message every 2 seconds to a monitor. The monitor prints when it receives a heartbeat and alerts if it misses one for more than 5 seconds.

python

import time
import threading

class Monitor:
    def __init__(self):
        self.last_heartbeat = time.time()
        self.lock = threading.Lock()

    def receive_heartbeat(self):
        with self.lock:
            self.last_heartbeat = time.time()
            print(f"Heartbeat received at {time.strftime('%X')}")

    def check_heartbeat(self):
        while True:
            time.sleep(1)
            with self.lock:
                if time.time() - self.last_heartbeat > 5:
                    print("ALERT: Missed heartbeat! Node might be down.")

class Node:
    def __init__(self, monitor):
        self.monitor = monitor

    def send_heartbeat(self):
        while True:
            time.sleep(2)
            self.monitor.receive_heartbeat()

monitor = Monitor()
node = Node(monitor)

threading.Thread(target=monitor.check_heartbeat, daemon=True).start()
threading.Thread(target=node.send_heartbeat, daemon=True).start()

# Run for 10 seconds to demonstrate
time.sleep(10)

Output

Heartbeat received at 12:00:02 Heartbeat received at 12:00:04 Heartbeat received at 12:00:06 Heartbeat received at 12:00:08 Heartbeat received at 12:00:10

🎯

When to Use

Heartbeats are used in distributed systems to monitor the health of nodes or services. They are essential when you want to detect failures quickly and maintain system reliability.

Common use cases include:

Cluster management to detect failed servers
Leader election algorithms where nodes need to know if the leader is alive
Load balancers checking backend server availability
Distributed databases ensuring data consistency by monitoring node status

✅

Key Points

A heartbeat is a periodic signal sent to show a node is alive.
Missing heartbeats indicate possible node failure.
They help systems react quickly to failures and maintain uptime.
Heartbeat intervals and timeout thresholds must be chosen carefully to balance detection speed and network load.

✅

Key Takeaways

A heartbeat is a regular signal nodes send to confirm they are alive in a distributed system.

Missing heartbeats help detect node failures quickly for fast recovery.

Heartbeats are crucial for system health monitoring, leader election, and load balancing.

Choosing the right heartbeat frequency and timeout is important for system performance.

Implementing heartbeats improves reliability and fault tolerance in distributed systems.