0
0
HadoopHow-ToBeginner ยท 4 min read

How HDFS Works in Hadoop: Overview and Example

The Hadoop Distributed File System (HDFS) stores large data by splitting it into blocks and distributing them across multiple machines called DataNodes. A central NameNode manages metadata and controls access, ensuring fault tolerance by replicating blocks across nodes.
๐Ÿ“

Syntax

HDFS commands are used to interact with the file system. The basic syntax for common operations is:

  • hdfs dfs -put <local_path> <hdfs_path>: Upload file to HDFS
  • hdfs dfs -get <hdfs_path> <local_path>: Download file from HDFS
  • hdfs dfs -ls <hdfs_path>: List files in HDFS directory
  • hdfs dfs -rm <hdfs_path>: Remove file from HDFS

These commands communicate with the NameNode to perform file operations on the distributed storage.

bash
hdfs dfs -put /local/file.txt /user/hadoop/file.txt
hdfs dfs -ls /user/hadoop
hdfs dfs -get /user/hadoop/file.txt /local/downloaded_file.txt
hdfs dfs -rm /user/hadoop/file.txt
Output
/user/hadoop/file.txt
๐Ÿ’ป

Example

This example shows how to upload a file to HDFS, list it, and then download it back to local storage.

bash
echo "Hello Hadoop HDFS" > localfile.txt
hdfs dfs -put localfile.txt /user/hadoop/
hdfs dfs -ls /user/hadoop
hdfs dfs -cat /user/hadoop/localfile.txt
hdfs dfs -get /user/hadoop/localfile.txt downloadedfile.txt
cat downloadedfile.txt
Output
/user/hadoop/localfile.txt Hello Hadoop HDFS Hello Hadoop HDFS
โš ๏ธ

Common Pitfalls

Common mistakes when working with HDFS include:

  • Not checking if the NameNode is running, causing commands to fail.
  • Uploading files without proper permissions, leading to access errors.
  • Ignoring replication factor settings, which can cause data loss if nodes fail.
  • Confusing local filesystem commands with HDFS commands.

Always verify the cluster status and use hdfs dfs commands for HDFS operations.

bash
hdfs dfs -put localfile.txt /user/hadoop/
# Wrong: Using 'hadoop fs -put' is legacy but still works
# Right: Use 'hdfs dfs -put' for modern Hadoop versions
๐Ÿ“Š

Quick Reference

CommandDescription
hdfs dfs -put Upload file to HDFS
hdfs dfs -get Download file from HDFS
hdfs dfs -ls List files in HDFS directory
hdfs dfs -rm Remove file from HDFS
hdfs dfs -cat Display file contents
โœ…

Key Takeaways

HDFS splits large files into blocks and stores them across DataNodes for distributed storage.
The NameNode manages metadata and controls access to files in HDFS.
Use 'hdfs dfs' commands to interact with HDFS safely and correctly.
Replication of blocks ensures data reliability and fault tolerance.
Always check cluster health and permissions before performing file operations.