0
0
Hadoopdata~5 mins

Rack awareness in HDFS in Hadoop

Choose your learning style9 modes available
Introduction

Rack awareness helps Hadoop store data smartly across different racks. This keeps data safe and fast to access.

When you want to protect data from rack failures like power loss or network issues.
When you want to improve data read speed by placing copies on different racks.
When you want to balance network traffic during data processing.
When you want to reduce the chance of losing all copies of data at once.
When setting up a Hadoop cluster with multiple racks.
Syntax
Hadoop
class RackAwarePlacementPolicy {
    // Map of DataNode to its rack
    Map<DataNode, String> nodeToRackMap = new HashMap<>();

    // Method to get rack for a DataNode
    String getRack(DataNode node) {
        return nodeToRackMap.get(node);
    }

    // Method to choose DataNodes for block placement
    List<DataNode> chooseDataNodes(int numReplicas) {
        // Logic to pick nodes from different racks
    }
}

This is a simplified view of how rack awareness is implemented in Hadoop.

Hadoop uses a network topology script to map nodes to racks automatically.

Examples
Example showing nodes assigned to two racks and choosing two nodes from different racks.
Hadoop
RackAwarePlacementPolicy policy = new RackAwarePlacementPolicy();
policy.nodeToRackMap.put(node1, "/rack1");
policy.nodeToRackMap.put(node2, "/rack2");

List<DataNode> chosenNodes = policy.chooseDataNodes(2);
If there are no nodes, no data nodes can be chosen.
Hadoop
// Edge case: Empty cluster
RackAwarePlacementPolicy emptyPolicy = new RackAwarePlacementPolicy();
List<DataNode> chosenNodes = emptyPolicy.chooseDataNodes(3); // returns empty list
When only one rack exists, all replicas are placed on that rack.
Hadoop
// Edge case: Only one rack available
policy.nodeToRackMap.clear();
policy.nodeToRackMap.put(node1, "/rack1");
policy.nodeToRackMap.put(node2, "/rack1");

List<DataNode> chosenNodes = policy.chooseDataNodes(2); // both nodes from same rack
Sample Program

This program creates three data nodes on two racks. It then chooses two nodes for storing replicas, preferring different racks.

Hadoop
import java.util.*;

class DataNode {
    String name;
    DataNode(String name) {
        this.name = name;
    }
    public String toString() {
        return name;
    }
}

class RackAwarePlacementPolicy {
    Map<DataNode, String> nodeToRackMap = new HashMap<>();

    String getRack(DataNode node) {
        return nodeToRackMap.get(node);
    }

    List<DataNode> chooseDataNodes(int numReplicas) {
        List<DataNode> chosenNodes = new ArrayList<>();
        Set<String> racksUsed = new HashSet<>();

        for (DataNode node : nodeToRackMap.keySet()) {
            String rack = getRack(node);
            if (!racksUsed.contains(rack)) {
                chosenNodes.add(node);
                racksUsed.add(rack);
                if (chosenNodes.size() == numReplicas) {
                    break;
                }
            }
        }

        // If not enough racks, fill with nodes from racks already used
        if (chosenNodes.size() < numReplicas) {
            for (DataNode node : nodeToRackMap.keySet()) {
                if (!chosenNodes.contains(node)) {
                    chosenNodes.add(node);
                    if (chosenNodes.size() == numReplicas) {
                        break;
                    }
                }
            }
        }

        return chosenNodes;
    }
}

public class RackAwarenessDemo {
    public static void main(String[] args) {
        DataNode node1 = new DataNode("Node1");
        DataNode node2 = new DataNode("Node2");
        DataNode node3 = new DataNode("Node3");

        RackAwarePlacementPolicy policy = new RackAwarePlacementPolicy();
        policy.nodeToRackMap.put(node1, "/rack1");
        policy.nodeToRackMap.put(node2, "/rack2");
        policy.nodeToRackMap.put(node3, "/rack1");

        System.out.println("Before choosing nodes:");
        for (var entry : policy.nodeToRackMap.entrySet()) {
            System.out.println(entry.getKey() + " on " + entry.getValue());
        }

        List<DataNode> chosenNodes = policy.chooseDataNodes(2);

        System.out.println("\nChosen DataNodes for replicas:");
        for (DataNode node : chosenNodes) {
            System.out.println(node + " on " + policy.getRack(node));
        }
    }
}
OutputSuccess
Important Notes

Time complexity: O(n) where n is number of data nodes, because it scans nodes to pick racks.

Space complexity: O(n) for storing node to rack mapping.

Common mistake: Not handling the case when fewer racks than replicas exist, causing all replicas on same rack.

Use rack awareness to improve fault tolerance and network efficiency compared to random placement.

Summary

Rack awareness places data copies on different racks to protect against rack failures.

It improves data availability and network usage in Hadoop clusters.

When racks are limited, replicas may share racks but still try to spread out.