Challenge - 5 Problems

🎖️

Spark Lazy Evaluation Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

1:30remaining

Understanding Lazy Evaluation in Spark

Which statement best describes lazy evaluation in Apache Spark?

ASpark delays execution of transformations until an action is called.

BSpark immediately executes all transformations as soon as they are called.

CSpark executes transformations in parallel without waiting for actions.

DSpark caches all data automatically to speed up computations.

Attempts:

2 left

❓ Predict Output

intermediate

2:00remaining

Output of Spark Transformations Without Action

What will be the output of the following Spark code snippet?

rdd = sc.parallelize([1, 2, 3, 4])
mapped_rdd = rdd.map(lambda x: x * 2)
print(mapped_rdd.collect())

Apache Spark

rdd = sc.parallelize([1, 2, 3, 4])
mapped_rdd = rdd.map(lambda x: x * 2)
print(mapped_rdd.collect())

A[2, 4, 6, 8]

BSyntaxError due to missing action

CAn empty list []

DNone

Attempts:

2 left

❓ data_output

advanced

2:00remaining

Number of Jobs Triggered by Actions

Given the following Spark code, how many Spark jobs will be triggered?

rdd = sc.parallelize([1, 2, 3, 4])
mapped = rdd.map(lambda x: x + 1)
filtered = mapped.filter(lambda x: x % 2 == 0)
count = filtered.count()
collected = filtered.collect()

Apache Spark

rdd = sc.parallelize([1, 2, 3, 4])
mapped = rdd.map(lambda x: x + 1)
filtered = mapped.filter(lambda x: x % 2 == 0)
count = filtered.count()
collected = filtered.collect()

Attempts:

2 left

🔧 Debug

advanced

1:30remaining

Identifying the Cause of No Output

Why does the following Spark code produce no output?

rdd = sc.parallelize([10, 20, 30])
rdd.map(lambda x: x * 3)

Apache Spark

rdd = sc.parallelize([10, 20, 30])
rdd.map(lambda x: x * 3)

AThe RDD is empty so no output is produced.

BThe lambda function syntax is incorrect causing silent failure.

CThe map transformation is lazy and no action was called to trigger execution.

DSpark requires caching before transformations to produce output.

Attempts:

2 left

🚀 Application

expert

2:30remaining

Optimizing Spark Job Execution

You have a Spark job with multiple transformations and two actions on the same RDD. How can you optimize to avoid running the same transformations twice?

ACall <code>collect()</code> after each transformation to save intermediate results.

BRewrite the transformations as actions to force immediate execution.

CSplit the RDD into two separate RDDs to run actions independently.

DUse <code>cache()</code> or <code>persist()</code> on the RDD before the actions to reuse computed data.

Attempts:

2 left