You need to run a batch Spark job on Google Cloud Dataproc. The job runs once a day and processes large data sets. Which cluster configuration is the best choice to optimize cost and performance?
Think about cost efficiency for jobs that run once a day.
Transient clusters are created for the job and deleted after completion, saving costs for batch jobs that run periodically. Permanent clusters running 24/7 incur costs even when idle. Autoscaling helps but is more suited for variable workloads.
You want to configure an autoscaling policy for a Dataproc cluster to scale workers between 2 and 10 nodes based on YARN memory usage. Which configuration snippet correctly sets this policy?
Scale up when usage is high, scale down when usage is low.
The minInstances and maxInstances define the scaling range. The scaleUpMemoryThresholdPercent should be higher than scaleDownMemoryThresholdPercent to avoid rapid scaling. Option C correctly sets min 2, max 10, scale up at 80%, scale down at 50%.
You want to ensure that only specific users can submit jobs to a Dataproc cluster, and no one else can access the cluster's VMs. Which approach best enforces this principle?
Least privilege means giving only the permissions needed to submit jobs, not full editing or VM control.
The 'roles/dataproc.jobUser' role allows users to submit jobs without full cluster editing rights. Restricting SSH access via firewall rules limits VM access to admins only, enforcing least privilege.
You run a Spark job on a Dataproc cluster with preemptible worker nodes. During execution, some preemptible workers are terminated by Google Cloud. What is the expected behavior of the Spark job?
Think about how Spark handles worker node failures.
Spark is designed to handle worker failures by retrying lost tasks on remaining or new nodes. Preemptible workers can be terminated, but Spark retries tasks to complete the job if enough resources remain.
You create a Dataproc cluster with network tags 'dataproc-cluster' and want to allow SSH access only from your office IP range. You create a firewall rule allowing TCP port 22 for source IPs in your office range and target tags 'dataproc-cluster'. However, you cannot SSH into the cluster VMs. What is the most likely cause?
Firewall rules with target tags only apply to VMs with matching network tags.
If the cluster VMs lack the specified network tag, the firewall rule targeting that tag does not apply, blocking SSH access. Assigning the correct network tag to the VMs fixes this.