Which of the following is the most important factor when deciding the size of a Hadoop cluster?
Think about what directly impacts storage and processing needs.
The total volume of data determines how much storage and processing power is needed, which directly affects cluster size.
You have 10 TB of raw data. Hadoop uses replication factor 3 by default. What is the total storage needed in the cluster?
Remember Hadoop stores multiple copies of data for fault tolerance.
With replication factor 3, each data block is stored 3 times, so total storage is 10 TB * 3 = 30 TB.
Given a Hadoop job requires 200 CPU cores and each node has 16 cores, how many nodes are needed?
Consider only whole nodes.
Divide total cores needed by cores per node and round up.
200 cores / 16 cores per node = 12.5 nodes, so 13 nodes are needed to meet the requirement.
You have a Hadoop cluster with 5 nodes. Each node has 4 TB storage. The replication factor is 3. How much usable storage is available in the cluster?
Calculate total raw storage then divide by replication factor.
Total raw storage is 5 nodes * 4 TB = 20 TB. Usable storage = 20 TB / 3 = 6.67 TB.
You manage a Hadoop cluster running both batch and real-time jobs. Batch jobs need high storage, real-time jobs need low latency and high CPU. Which cluster sizing strategy best balances these needs?
Think about workload isolation and resource specialization.
Separating workloads on nodes optimized for their needs improves performance and resource use.