Node decommissioning and scaling in Hadoop - Time & Space Complexity
When Hadoop scales or removes nodes, it moves data around. We want to know how the time to move data changes as the cluster grows.
How does the work increase when we add or remove nodes?
Analyze the time complexity of the following code snippet.
// Pseudo Hadoop code for node decommissioning
for each block in clusterDataBlocks:
if block is on decommissionedNode:
find newNode to replicate block
replicate block to newNode
update metadata
This code moves data blocks from a node being removed to other nodes, updating the system metadata.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Loop over all data blocks in the cluster.
- How many times: Once for each block, which depends on total data size.
As the number of data blocks grows, the time to move blocks from the decommissioned node grows roughly the same.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 10 block moves |
| 100 | 100 block moves |
| 1000 | 1000 block moves |
Pattern observation: The work grows directly with the number of blocks to move.
Time Complexity: O(n)
This means the time to decommission scales linearly with the number of data blocks involved.
[X] Wrong: "Decommissioning time stays the same no matter how many blocks there are."
[OK] Correct: Each block must be moved and updated, so more blocks mean more work and more time.
Understanding how data moves when nodes change helps you explain system scaling and reliability in real jobs.
"What if the cluster used data replication to multiple nodes at once? How would that affect the time complexity?"