Heartbeat in Distributed System: Definition and Usage
heartbeat is a regular signal sent between nodes to indicate they are alive and functioning. It helps detect failures quickly by monitoring these periodic messages.How It Works
Imagine a group of friends on a hiking trip who agree to check in with each other every 10 minutes to say "I'm okay." This check-in is like a heartbeat in a distributed system. Each computer or node sends a small message at regular intervals to a central monitor or to each other.
If a node misses sending its heartbeat, the system assumes something is wrong, like the node crashed or lost connection. This helps the system react quickly, such as by restarting the node or shifting work to others, keeping the whole system healthy.
Example
This example shows a simple Python script where a node sends a heartbeat message every 2 seconds to a monitor. The monitor prints when it receives a heartbeat and alerts if it misses one for more than 5 seconds.
import time import threading class Monitor: def __init__(self): self.last_heartbeat = time.time() self.lock = threading.Lock() def receive_heartbeat(self): with self.lock: self.last_heartbeat = time.time() print(f"Heartbeat received at {time.strftime('%X')}") def check_heartbeat(self): while True: time.sleep(1) with self.lock: if time.time() - self.last_heartbeat > 5: print("ALERT: Missed heartbeat! Node might be down.") class Node: def __init__(self, monitor): self.monitor = monitor def send_heartbeat(self): while True: time.sleep(2) self.monitor.receive_heartbeat() monitor = Monitor() node = Node(monitor) threading.Thread(target=monitor.check_heartbeat, daemon=True).start() threading.Thread(target=node.send_heartbeat, daemon=True).start() # Run for 10 seconds to demonstrate time.sleep(10)
When to Use
Heartbeats are used in distributed systems to monitor the health of nodes or services. They are essential when you want to detect failures quickly and maintain system reliability.
Common use cases include:
- Cluster management to detect failed servers
- Leader election algorithms where nodes need to know if the leader is alive
- Load balancers checking backend server availability
- Distributed databases ensuring data consistency by monitoring node status
Key Points
- A heartbeat is a periodic signal sent to show a node is alive.
- Missing heartbeats indicate possible node failure.
- They help systems react quickly to failures and maintain uptime.
- Heartbeat intervals and timeout thresholds must be chosen carefully to balance detection speed and network load.