0
0
Apache Sparkdata~20 mins

Google Dataproc overview in Apache Spark - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Dataproc Mastery Badge
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
1:30remaining
What is the primary purpose of Google Dataproc?

Google Dataproc is a managed service for running which type of workloads?

ARunning big data processing jobs using Apache Spark and Hadoop clusters.
BHosting web applications with automatic scaling.
CStoring large amounts of unstructured data in a NoSQL database.
DProviding real-time analytics dashboards for business intelligence.
Attempts:
2 left
💡 Hint

Think about what Apache Spark and Hadoop are used for.

data_output
intermediate
2:00remaining
Output of a Spark job on Dataproc cluster

Given the following PySpark code running on a Dataproc cluster, what is the output?

data = ["apple", "banana", "apple", "orange", "banana", "apple"]
rdd = spark.sparkContext.parallelize(data)
counts = rdd.map(lambda x: (x, 1)).reduceByKey(lambda a, b: a + b).collect()
print(sorted(counts))
Apache Spark
data = ["apple", "banana", "apple", "orange", "banana", "apple"]
rdd = spark.sparkContext.parallelize(data)
counts = rdd.map(lambda x: (x, 1)).reduceByKey(lambda a, b: a + b).collect()
print(sorted(counts))
A[('apple', 3), ('banana', 2), ('orange', 1)]
B[('apple', 1), ('banana', 1), ('orange', 1)]
C[('apple', 3), ('banana', 3), ('orange', 1)]
D[('apple', 2), ('banana', 2), ('orange', 2)]
Attempts:
2 left
💡 Hint

Count how many times each fruit appears in the list.

🔧 Debug
advanced
2:00remaining
Identify the error in Dataproc Spark job submission

Which option describes the error when submitting a Spark job to Dataproc with this command?

gcloud dataproc jobs submit spark --cluster=my-cluster --region=us-central1 --class=org.apache.spark.examples.SparkPi --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar -- 1000
AThe argument '1000' should be passed before the --jars option.
BThe --class argument is missing the main class name.
CThe --jars argument uses a local file path that is not accessible to the cluster nodes.
DThe --region flag is not supported in gcloud dataproc jobs submit spark.
Attempts:
2 left
💡 Hint

Consider where the cluster nodes can access files specified in --jars.

🚀 Application
advanced
2:00remaining
Choosing the right Dataproc cluster configuration

You need to process a large batch of data with Apache Spark on Dataproc. The job requires high memory and CPU resources but runs only for a short time. Which cluster configuration is best?

AUse a small cluster with preemptible worker nodes to save cost.
BUse a large cluster with high-memory machine types and disable autoscaling.
CUse a single-node cluster with high-CPU machine type.
DUse a large cluster with standard worker nodes and enable autoscaling.
Attempts:
2 left
💡 Hint

Think about balancing cost, performance, and job duration.

visualization
expert
2:30remaining
Interpreting Dataproc job monitoring dashboard

In the Dataproc job monitoring dashboard, you see a Spark job with a long GC (Garbage Collection) time and low CPU utilization. What does this indicate?

AThe job is CPU-bound and needs more CPU resources.
BThe job is spending too much time cleaning memory, indicating possible memory leaks or inefficient memory use.
CThe job is I/O-bound and waiting on disk operations.
DThe job is network-bound due to data shuffling between nodes.
Attempts:
2 left
💡 Hint

High GC time usually relates to memory management issues.