Challenge - 5 Problems

🎖️

SparkSession and SparkContext Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output of SparkSession and SparkContext version check

What will be the output of the following Spark code snippet?

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('TestApp').getOrCreate()
sc = spark.sparkContext
print(sc.version)

A"2.4.0"

BSyntaxError

CNone

D"3.4.0"

Attempts:

2 left

❓ data_output

intermediate

2:00remaining

Number of partitions in RDD created from SparkContext

Given the following code, what is the number of partitions in the RDD?

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('PartitionTest').getOrCreate()
sc = spark.sparkContext
rdd = sc.parallelize(range(10), 4)
print(rdd.getNumPartitions())

B10

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Identify the error in SparkSession creation

What error will the following code produce?

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('App').getOrCreate
print(spark.sparkContext.appName)

AAttributeError: 'function' object has no attribute 'sparkContext'

BTypeError: getOrCreate() missing 1 required positional argument

CNameError: name 'SparkSession' is not defined

DNo error, prints 'App'

Attempts:

2 left

🧠 Conceptual

advanced

2:00remaining

Difference between SparkSession and SparkContext

Which statement correctly describes the difference between SparkSession and SparkContext?

A<code>SparkSession</code> is the entry point for DataFrame API, while <code>SparkContext</code> is the entry point for RDD API.

B<code>SparkContext</code> manages SQL queries, <code>SparkSession</code> manages cluster resources.

C<code>SparkSession</code> is deprecated and replaced by <code>SparkContext</code>.

D<code>SparkContext</code> is used only for streaming, <code>SparkSession</code> only for batch jobs.

Attempts:

2 left

🚀 Application

expert

3:00remaining

Result of caching DataFrame and counting partitions

Consider the following code. What will be the output?

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('CacheTest').getOrCreate()
df = spark.createDataFrame([(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')], ['id', 'value'])
df_cached = df.cache()
count = df_cached.count()
partitions = df_cached.rdd.getNumPartitions()
print(count, partitions)

A0 1

B4 1

CSyntaxError

D4 4

Attempts:

2 left