HadoopHow-ToBeginner · 4 min read

How to Configure Hadoop Cluster: Step-by-Step Guide

To configure a Hadoop cluster, install Hadoop on all nodes, set up core-site.xml, hdfs-site.xml, and yarn-site.xml configuration files with cluster details, and start the HDFS and YARN services. This setup enables distributed storage and processing across multiple machines.

📐

Syntax

Configuring a Hadoop cluster involves editing key XML files on all nodes:

core-site.xml: Defines the NameNode address and file system settings.
hdfs-site.xml: Configures replication and storage directories for HDFS.
yarn-site.xml: Sets ResourceManager and NodeManager details for job scheduling.
mapred-site.xml: Specifies the MapReduce framework settings.

After configuration, use shell commands to format the NameNode and start Hadoop services.

xml

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://namenode-hostname:9000</value>
  </property>
</configuration>

💻

Example

This example shows a minimal setup for a single NameNode and two DataNodes. It includes configuration snippets and commands to start the cluster.

bash and xml

# core-site.xml
<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://master:9000</value>
  </property>
</configuration>

# hdfs-site.xml
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>2</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>/var/hadoop/dfs/name</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>/var/hadoop/dfs/data</value>
  </property>
</configuration>

# yarn-site.xml
<configuration>
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>master</value>
  </property>
</configuration>

# Commands to format and start cluster
hdfs namenode -format
start-dfs.sh
start-yarn.sh

Output

Formatting using cluster id: CID-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx Starting namenode on master Starting datanode on datanode1 Starting datanode on datanode2 Starting ResourceManager on master Starting NodeManager on datanode1 Starting NodeManager on datanode2

⚠️

Common Pitfalls

Common mistakes when configuring a Hadoop cluster include:

Incorrect fs.defaultFS URL causing clients to fail connecting to NameNode.
Not setting proper directory permissions for NameNode and DataNode storage paths.
Mismatch in configuration files across nodes leading to inconsistent cluster behavior.
Forgetting to format the NameNode before starting the cluster.
Firewall or network issues blocking communication between nodes.

Always verify configuration consistency and network connectivity before starting services.

xml

# Wrong fs.defaultFS example
<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://wrong-hostname:9000</value>
  </property>
</configuration>

# Correct fs.defaultFS example
<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://master:9000</value>
  </property>
</configuration>

📊

Quick Reference

Summary tips for Hadoop cluster configuration:

Set fs.defaultFS in core-site.xml to your NameNode address.
Configure replication factor and storage directories in hdfs-site.xml.
Define ResourceManager hostname in yarn-site.xml.
Format NameNode once before starting services.
Ensure all nodes have consistent configuration files.

✅

Key Takeaways

Configure core-site.xml, hdfs-site.xml, and yarn-site.xml with correct cluster details on all nodes.

Format the NameNode before starting HDFS services to initialize the file system.

Ensure consistent configuration files and proper directory permissions across all cluster nodes.

Start HDFS and YARN services using start-dfs.sh and start-yarn.sh scripts.

Verify network connectivity and firewall settings to allow node communication.