0
0
Apache Sparkdata~20 mins

SparkSession and SparkContext in Apache Spark - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
SparkSession and SparkContext Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of SparkSession and SparkContext version check

What will be the output of the following Spark code snippet?

Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('TestApp').getOrCreate()
sc = spark.sparkContext
print(sc.version)
A"2.4.0"
BSyntaxError
CNone
D"3.4.0"
Attempts:
2 left
💡 Hint

Check the version of SparkContext obtained from SparkSession.

data_output
intermediate
2:00remaining
Number of partitions in RDD created from SparkContext

Given the following code, what is the number of partitions in the RDD?

Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('PartitionTest').getOrCreate()
sc = spark.sparkContext
rdd = sc.parallelize(range(10), 4)
print(rdd.getNumPartitions())
A1
B10
C4
D0
Attempts:
2 left
💡 Hint

Look at the second argument of parallelize which sets partitions.

🔧 Debug
advanced
2:00remaining
Identify the error in SparkSession creation

What error will the following code produce?

Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('App').getOrCreate
print(spark.sparkContext.appName)
AAttributeError: 'function' object has no attribute 'sparkContext'
BTypeError: getOrCreate() missing 1 required positional argument
CNameError: name 'SparkSession' is not defined
DNo error, prints 'App'
Attempts:
2 left
💡 Hint

Check if getOrCreate is called correctly.

🧠 Conceptual
advanced
2:00remaining
Difference between SparkSession and SparkContext

Which statement correctly describes the difference between SparkSession and SparkContext?

A<code>SparkSession</code> is the entry point for DataFrame API, while <code>SparkContext</code> is the entry point for RDD API.
B<code>SparkContext</code> manages SQL queries, <code>SparkSession</code> manages cluster resources.
C<code>SparkSession</code> is deprecated and replaced by <code>SparkContext</code>.
D<code>SparkContext</code> is used only for streaming, <code>SparkSession</code> only for batch jobs.
Attempts:
2 left
💡 Hint

Think about the APIs each object supports.

🚀 Application
expert
3:00remaining
Result of caching DataFrame and counting partitions

Consider the following code. What will be the output?

Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('CacheTest').getOrCreate()
df = spark.createDataFrame([(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')], ['id', 'value'])
df_cached = df.cache()
count = df_cached.count()
partitions = df_cached.rdd.getNumPartitions()
print(count, partitions)
A0 1
B4 1
CSyntaxError
D4 4
Attempts:
2 left
💡 Hint

Check default number of partitions for DataFrame created from local data.