Which statement correctly describes the difference between Spark's local mode and cluster mode?
Think about how many machines and JVMs are involved in each mode.
Local mode runs Spark on one machine using a single JVM, suitable for testing or small data. Cluster mode distributes tasks across multiple machines, enabling large-scale data processing.
What will be the output of the following Spark configuration code snippet?
from pyspark.sql import SparkSession spark = SparkSession.builder.master('local[4]').appName('TestApp').getOrCreate() print(spark.sparkContext.master)
Check the master URL string passed to the builder.
The master URL 'local[4]' means Spark runs locally with 4 worker threads. The print statement outputs this exact string.
Consider this Spark code that creates an RDD and counts partitions:
rdd = spark.sparkContext.parallelize(range(10), 3) print(rdd.getNumPartitions())
What will be the output when running in local mode and cluster mode respectively?
The number of partitions is set explicitly in the parallelize call.
The number of partitions is controlled by the second argument to parallelize. It is 3 regardless of execution mode.
A user submits a Spark job with master URL 'local' but expects it to run on a cluster. The job fails with connection errors. What is the most likely cause?
Check what the 'local' master URL means for Spark execution.
The 'local' master URL tells Spark to run locally on one machine. It does not connect to any cluster, so connection errors occur if cluster resources are expected.
You have a dataset of 5 TB stored on HDFS and want to run a Spark job to analyze it. Which execution mode should you choose and why?
Think about the size of data and how Spark handles distributed processing.
Cluster mode is designed to process large datasets by distributing tasks across many machines. Local mode runs on one machine and is not suitable for very large data.