You want to run a Spark job on AWS EMR that processes large datasets with high fault tolerance. Which cluster configuration best fits this need?
Think about fault tolerance and workload distribution in EMR clusters.
Option C provides a master node to manage the cluster, core nodes to store data and run tasks, and task nodes to handle additional processing. This setup offers fault tolerance and scalability.
You want to restrict SSH access to your EMR cluster only from your office IP address. Which AWS feature should you configure?
Think about network-level access control.
Security groups act like virtual firewalls controlling inbound and outbound traffic. Restricting SSH access by IP is done via security group rules.
Given an EMR cluster with auto scaling enabled, what happens when the workload decreases significantly?
Consider which nodes store data and which nodes are ephemeral.
Core nodes store data in HDFS and are essential for cluster stability, so they are not terminated automatically. Task nodes are ephemeral and can be scaled down to save costs.
You want to reduce EMR cluster costs by using Spot Instances for worker nodes. Which configuration is correct?
Think about which nodes are critical for cluster stability.
The master and core nodes are critical for cluster stability and data storage, so they should be On-Demand. Task nodes can be Spot Instances to save costs as they are replaceable.
What happens to the data stored on HDFS of EMR core nodes when the cluster is terminated?
Consider the nature of ephemeral storage on EMR core nodes.
HDFS storage on EMR core nodes is ephemeral and tied to the cluster lifecycle. Data must be saved to durable storage like S3 before cluster termination to avoid loss.