HDFS read and write operations in Hadoop - Time & Space Complexity
When working with HDFS, it is important to understand how reading and writing data scales as the data size grows.
We want to know how the time to read or write changes when the file size increases.
Analyze the time complexity of the following simplified HDFS write and read operations.
// Write operation
for each block in file:
write block to DataNode
replicate block to other DataNodes
// Read operation
for each block in file:
read block from closest DataNode
This code writes a file by splitting it into blocks, writing each block, and replicating it. Reading fetches each block from the nearest node.
Look at what repeats as the file size grows.
- Primary operation: Writing or reading each block of the file.
- How many times: Number of blocks equals file size divided by block size, so it grows linearly with file size.
As the file size increases, the number of blocks increases proportionally.
| Input Size (MB) | Approx. Number of Blocks |
|---|---|
| 10 | 10 (if block size = 1MB) |
| 100 | 100 |
| 1000 | 1000 |
Pattern observation: Doubling the file size roughly doubles the number of blocks to read or write, so time grows linearly.
Time Complexity: O(n)
This means the time to read or write grows directly in proportion to the file size.
[X] Wrong: "Reading or writing a file in HDFS takes the same time no matter the file size."
[OK] Correct: Because HDFS splits files into blocks, more data means more blocks, so more operations and more time.
Understanding how HDFS handles data helps you explain distributed storage performance clearly, a useful skill in data engineering roles.
What if the block size was doubled? How would that affect the time complexity of reading and writing?