0
0
GCPcloud~20 mins

Dataproc for Spark/Hadoop in GCP - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Dataproc Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Architecture
intermediate
2:00remaining
Choosing the right cluster type for a batch Spark job

You need to run a batch Spark job on Google Cloud Dataproc. The job runs once a day and processes large data sets. Which cluster configuration is the best choice to optimize cost and performance?

ACreate a permanent Dataproc cluster with autoscaling enabled and submit the job to it.
BCreate a transient Dataproc cluster that is created before the job and deleted after the job finishes.
CUse a permanent Dataproc cluster without autoscaling and keep it running 24/7.
DUse a single-node Dataproc cluster with preemptible workers for cost savings.
Attempts:
2 left
💡 Hint

Think about cost efficiency for jobs that run once a day.

Configuration
intermediate
2:00remaining
Configuring autoscaling policy for Dataproc cluster

You want to configure an autoscaling policy for a Dataproc cluster to scale workers between 2 and 10 nodes based on YARN memory usage. Which configuration snippet correctly sets this policy?

A{ "workerConfig": { "minInstances": 2, "maxInstances": 10 }, "basicAlgorithm": { "yarnConfig": { "scaleUpMemoryThresholdPercent": 50, "scaleDownMemoryThresholdPercent": 80 } } }
B{ "workerConfig": { "minInstances": 10, "maxInstances": 2 }, "basicAlgorithm": { "yarnConfig": { "scaleUpMemoryThresholdPercent": 50, "scaleDownMemoryThresholdPercent": 80 } } }
C{ "workerConfig": { "minInstances": 2, "maxInstances": 10 }, "basicAlgorithm": { "yarnConfig": { "scaleUpMemoryThresholdPercent": 80, "scaleDownMemoryThresholdPercent": 50 } } }
D{ "workerConfig": { "minInstances": 2, "maxInstances": 10 }, "basicAlgorithm": { "yarnConfig": { "scaleUpMemoryThresholdPercent": 90, "scaleDownMemoryThresholdPercent": 90 } } }
Attempts:
2 left
💡 Hint

Scale up when usage is high, scale down when usage is low.

security
advanced
2:00remaining
Securing access to Dataproc clusters with least privilege

You want to ensure that only specific users can submit jobs to a Dataproc cluster, and no one else can access the cluster's VMs. Which approach best enforces this principle?

AGrant the users the 'roles/dataproc.jobUser' IAM role on the cluster and configure firewall rules to restrict SSH access to only admin IPs.
BGrant the users the 'roles/dataproc.editor' IAM role on the cluster and disable SSH access to the cluster VMs.
CGrant the users the 'roles/compute.instanceAdmin' role and allow SSH access to all cluster VMs.
DGrant the users the 'roles/dataproc.viewer' role and enable OS Login for all users.
Attempts:
2 left
💡 Hint

Least privilege means giving only the permissions needed to submit jobs, not full editing or VM control.

service_behavior
advanced
2:00remaining
Understanding job failure behavior on preemptible workers

You run a Spark job on a Dataproc cluster with preemptible worker nodes. During execution, some preemptible workers are terminated by Google Cloud. What is the expected behavior of the Spark job?

AThe Spark job automatically retries tasks lost due to preemptible worker termination and completes successfully if enough workers remain.
BThe Spark job immediately fails when any preemptible worker is terminated.
CThe Spark job continues without retrying lost tasks, resulting in incomplete output.
DThe Spark job pauses until the preemptible workers are replaced by new nodes.
Attempts:
2 left
💡 Hint

Think about how Spark handles worker node failures.

🧠 Conceptual
expert
2:00remaining
Impact of network tags on Dataproc cluster firewall rules

You create a Dataproc cluster with network tags 'dataproc-cluster' and want to allow SSH access only from your office IP range. You create a firewall rule allowing TCP port 22 for source IPs in your office range and target tags 'dataproc-cluster'. However, you cannot SSH into the cluster VMs. What is the most likely cause?

AThe firewall rule source IP range is incorrect and does not include your office IPs.
BThe firewall rule priority is too low, so it is overridden by a deny-all rule.
CDataproc clusters ignore network tags for firewall rules and require IP-based rules only.
DThe cluster VMs do not have the 'dataproc-cluster' network tag assigned, so the firewall rule does not apply.
Attempts:
2 left
💡 Hint

Firewall rules with target tags only apply to VMs with matching network tags.