What will be the output of the following Spark code snippet?
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('TestApp').getOrCreate() sc = spark.sparkContext print(sc.version)
Check the version of SparkContext obtained from SparkSession.
The SparkContext version reflects the Spark version installed. The code prints the version string, e.g., "3.4.0".
Given the following code, what is the number of partitions in the RDD?
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('PartitionTest').getOrCreate() sc = spark.sparkContext rdd = sc.parallelize(range(10), 4) print(rdd.getNumPartitions())
Look at the second argument of parallelize which sets partitions.
The parallelize method's second argument sets the number of partitions. Here it is 4.
What error will the following code produce?
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('App').getOrCreate print(spark.sparkContext.appName)
Check if getOrCreate is called correctly.
The code misses parentheses after getOrCreate, so spark is a function, not a SparkSession object. Accessing spark.sparkContext causes AttributeError.
Which statement correctly describes the difference between SparkSession and SparkContext?
Think about the APIs each object supports.
SparkSession provides unified entry for DataFrame and SQL APIs. SparkContext is the older entry point for RDDs.
Consider the following code. What will be the output?
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('CacheTest').getOrCreate() df = spark.createDataFrame([(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')], ['id', 'value']) df_cached = df.cache() count = df_cached.count() partitions = df_cached.rdd.getNumPartitions() print(count, partitions)
Check default number of partitions for DataFrame created from local data.
DataFrame created from local list has 1 partition by default. Counting rows returns 4. Caching does not change partitions.