Challenge - 5 Problems

🎖️

Dataproc Mastery Badge

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

1:30remaining

What is the primary purpose of Google Dataproc?

Google Dataproc is a managed service for running which type of workloads?

ARunning big data processing jobs using Apache Spark and Hadoop clusters.

BHosting web applications with automatic scaling.

CStoring large amounts of unstructured data in a NoSQL database.

DProviding real-time analytics dashboards for business intelligence.

Attempts:

2 left

❓ data_output

intermediate

2:00remaining

Output of a Spark job on Dataproc cluster

Given the following PySpark code running on a Dataproc cluster, what is the output?

data = ["apple", "banana", "apple", "orange", "banana", "apple"]
rdd = spark.sparkContext.parallelize(data)
counts = rdd.map(lambda x: (x, 1)).reduceByKey(lambda a, b: a + b).collect()
print(sorted(counts))

Apache Spark

data = ["apple", "banana", "apple", "orange", "banana", "apple"]
rdd = spark.sparkContext.parallelize(data)
counts = rdd.map(lambda x: (x, 1)).reduceByKey(lambda a, b: a + b).collect()
print(sorted(counts))

A[('apple', 3), ('banana', 2), ('orange', 1)]

B[('apple', 1), ('banana', 1), ('orange', 1)]

C[('apple', 3), ('banana', 3), ('orange', 1)]

D[('apple', 2), ('banana', 2), ('orange', 2)]

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Identify the error in Dataproc Spark job submission

Which option describes the error when submitting a Spark job to Dataproc with this command?

gcloud dataproc jobs submit spark --cluster=my-cluster --region=us-central1 --class=org.apache.spark.examples.SparkPi --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar -- 1000

AThe argument '1000' should be passed before the --jars option.

BThe --class argument is missing the main class name.

CThe --jars argument uses a local file path that is not accessible to the cluster nodes.

DThe --region flag is not supported in gcloud dataproc jobs submit spark.

Attempts:

2 left

🚀 Application

advanced

2:00remaining

Choosing the right Dataproc cluster configuration

You need to process a large batch of data with Apache Spark on Dataproc. The job requires high memory and CPU resources but runs only for a short time. Which cluster configuration is best?

AUse a small cluster with preemptible worker nodes to save cost.

BUse a large cluster with high-memory machine types and disable autoscaling.

CUse a single-node cluster with high-CPU machine type.

DUse a large cluster with standard worker nodes and enable autoscaling.

Attempts:

2 left

❓ visualization

expert

2:30remaining

Interpreting Dataproc job monitoring dashboard

In the Dataproc job monitoring dashboard, you see a Spark job with a long GC (Garbage Collection) time and low CPU utilization. What does this indicate?

AThe job is CPU-bound and needs more CPU resources.

BThe job is spending too much time cleaning memory, indicating possible memory leaks or inefficient memory use.

CThe job is I/O-bound and waiting on disk operations.

DThe job is network-bound due to data shuffling between nodes.

Attempts:

2 left