Google Dataproc is a managed service for running which type of workloads?
Think about what Apache Spark and Hadoop are used for.
Google Dataproc is designed to run big data processing workloads using Apache Spark and Hadoop clusters in a managed environment.
Given the following PySpark code running on a Dataproc cluster, what is the output?
data = ["apple", "banana", "apple", "orange", "banana", "apple"] rdd = spark.sparkContext.parallelize(data) counts = rdd.map(lambda x: (x, 1)).reduceByKey(lambda a, b: a + b).collect() print(sorted(counts))
data = ["apple", "banana", "apple", "orange", "banana", "apple"] rdd = spark.sparkContext.parallelize(data) counts = rdd.map(lambda x: (x, 1)).reduceByKey(lambda a, b: a + b).collect() print(sorted(counts))
Count how many times each fruit appears in the list.
The code counts occurrences of each fruit. 'apple' appears 3 times, 'banana' 2 times, and 'orange' once.
Which option describes the error when submitting a Spark job to Dataproc with this command?
gcloud dataproc jobs submit spark --cluster=my-cluster --region=us-central1 --class=org.apache.spark.examples.SparkPi --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar -- 1000
Consider where the cluster nodes can access files specified in --jars.
The --jars option requires a path accessible by cluster nodes, usually a GCS path, not a local file path.
You need to process a large batch of data with Apache Spark on Dataproc. The job requires high memory and CPU resources but runs only for a short time. Which cluster configuration is best?
Think about balancing cost, performance, and job duration.
A large cluster with standard nodes and autoscaling provides enough resources and adjusts size to optimize cost for short jobs.
In the Dataproc job monitoring dashboard, you see a Spark job with a long GC (Garbage Collection) time and low CPU utilization. What does this indicate?
High GC time usually relates to memory management issues.
Long garbage collection times with low CPU usage suggest the job spends time managing memory inefficiently, possibly due to memory leaks or large object creation.