Cluster planning and sizing in Hadoop - Time & Space Complexity
When planning and sizing a Hadoop cluster, we want to understand how the time to process data grows as we add more data or nodes.
We ask: How does the cluster's performance change when the workload or cluster size changes?
Analyze the time complexity of this simplified Hadoop job scheduling snippet.
// Pseudocode for job scheduling in a Hadoop cluster
for each job in jobQueue:
for each node in clusterNodes:
if node has free slots:
assign job task to node
break
end if
end for
end for
This code assigns each job task to the first available node with free slots in the cluster.
Look at the loops that repeat work:
- Primary operation: The nested loops: for each job, check nodes for free slots.
- How many times: Outer loop runs once per job (n jobs), inner loop runs up to number of nodes (m nodes) per job.
As the number of jobs or nodes changes, the work changes like this:
| Input Size (jobs n) | Nodes (m) | Approx. Operations |
|---|---|---|
| 10 | 5 | Up to 50 checks |
| 100 | 10 | Up to 1000 checks |
| 1000 | 20 | Up to 20,000 checks |
Pattern observation: The total checks grow roughly by multiplying jobs and nodes.
Time Complexity: O(n * m)
This means the time to assign tasks grows proportionally to the number of jobs times the number of nodes.
[X] Wrong: "Adding more nodes always makes the scheduling faster linearly."
[OK] Correct: Because the scheduler still checks nodes for each job, so more nodes can increase scheduling time if not managed well.
Understanding how cluster size and job count affect scheduling time helps you design better systems and explain your reasoning clearly in interviews.
"What if the scheduler used a queue to track free nodes instead of checking all nodes each time? How would the time complexity change?"