Challenge - 5 Problems

🎖️

Dataproc Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Architecture

intermediate

2:00remaining

Choosing the right cluster type for a batch Spark job

You need to run a batch Spark job on Google Cloud Dataproc. The job runs once a day and processes large data sets. Which cluster configuration is the best choice to optimize cost and performance?

ACreate a permanent Dataproc cluster with autoscaling enabled and submit the job to it.

BCreate a transient Dataproc cluster that is created before the job and deleted after the job finishes.

CUse a permanent Dataproc cluster without autoscaling and keep it running 24/7.

DUse a single-node Dataproc cluster with preemptible workers for cost savings.

Attempts:

2 left

❓ Configuration

intermediate

2:00remaining

Configuring autoscaling policy for Dataproc cluster

You want to configure an autoscaling policy for a Dataproc cluster to scale workers between 2 and 10 nodes based on YARN memory usage. Which configuration snippet correctly sets this policy?

A{ "workerConfig": { "minInstances": 2, "maxInstances": 10 }, "basicAlgorithm": { "yarnConfig": { "scaleUpMemoryThresholdPercent": 50, "scaleDownMemoryThresholdPercent": 80 } } }

B{ "workerConfig": { "minInstances": 10, "maxInstances": 2 }, "basicAlgorithm": { "yarnConfig": { "scaleUpMemoryThresholdPercent": 50, "scaleDownMemoryThresholdPercent": 80 } } }

C{ "workerConfig": { "minInstances": 2, "maxInstances": 10 }, "basicAlgorithm": { "yarnConfig": { "scaleUpMemoryThresholdPercent": 80, "scaleDownMemoryThresholdPercent": 50 } } }

D{ "workerConfig": { "minInstances": 2, "maxInstances": 10 }, "basicAlgorithm": { "yarnConfig": { "scaleUpMemoryThresholdPercent": 90, "scaleDownMemoryThresholdPercent": 90 } } }

Attempts:

2 left

❓ security

advanced

2:00remaining

Securing access to Dataproc clusters with least privilege

You want to ensure that only specific users can submit jobs to a Dataproc cluster, and no one else can access the cluster's VMs. Which approach best enforces this principle?

AGrant the users the 'roles/dataproc.jobUser' IAM role on the cluster and configure firewall rules to restrict SSH access to only admin IPs.

BGrant the users the 'roles/dataproc.editor' IAM role on the cluster and disable SSH access to the cluster VMs.

CGrant the users the 'roles/compute.instanceAdmin' role and allow SSH access to all cluster VMs.

DGrant the users the 'roles/dataproc.viewer' role and enable OS Login for all users.

Attempts:

2 left

❓ service_behavior

advanced

2:00remaining

Understanding job failure behavior on preemptible workers

You run a Spark job on a Dataproc cluster with preemptible worker nodes. During execution, some preemptible workers are terminated by Google Cloud. What is the expected behavior of the Spark job?

AThe Spark job automatically retries tasks lost due to preemptible worker termination and completes successfully if enough workers remain.

BThe Spark job immediately fails when any preemptible worker is terminated.

CThe Spark job continues without retrying lost tasks, resulting in incomplete output.

DThe Spark job pauses until the preemptible workers are replaced by new nodes.

Attempts:

2 left

🧠 Conceptual

expert

2:00remaining

Impact of network tags on Dataproc cluster firewall rules

You create a Dataproc cluster with network tags 'dataproc-cluster' and want to allow SSH access only from your office IP range. You create a firewall rule allowing TCP port 22 for source IPs in your office range and target tags 'dataproc-cluster'. However, you cannot SSH into the cluster VMs. What is the most likely cause?

AThe firewall rule source IP range is incorrect and does not include your office IPs.

BThe firewall rule priority is too low, so it is overridden by a deny-all rule.

CDataproc clusters ignore network tags for firewall rules and require IP-based rules only.

DThe cluster VMs do not have the 'dataproc-cluster' network tag assigned, so the firewall rule does not apply.

Attempts:

2 left