Challenge - 5 Problems

🎖️

Spark Pipeline Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

2:00remaining

Why are Spark transformations lazy?

In Apache Spark, transformations like map() and filter() are called lazy. What is the main reason for this laziness?

ATo build a processing pipeline that optimizes execution before running any computation

BTo immediately execute each transformation and store intermediate results

CTo prevent any data from being processed at all

DTo automatically cache all data in memory after each transformation

Attempts:

2 left

❓ Predict Output

intermediate

2:00remaining

Output of chained transformations in Spark

Given the following Spark code, what is the output when result.collect() is called?

Apache Spark

rdd = sc.parallelize([1, 2, 3, 4, 5])
result = rdd.filter(lambda x: x % 2 == 0).map(lambda x: x * 10)

A[2, 4]

B[20, 40]

C[10, 20, 30, 40, 50]

D[]

Attempts:

2 left

❓ data_output

advanced

2:00remaining

Number of stages in a Spark pipeline

Consider this Spark code snippet. How many stages will Spark create when result.count() is called?

Apache Spark

rdd = sc.parallelize([1, 2, 3, 4, 5])
result = rdd.map(lambda x: x + 1).filter(lambda x: x > 3).map(lambda x: x * 2)

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Why does this Spark job run slowly?

This Spark code runs slowly. What is the main reason?

Apache Spark

rdd = sc.textFile('data.txt')
words = rdd.flatMap(lambda line: line.split())
words_filtered = words.filter(lambda w: len(w) > 3)
words_filtered.cache()
count = words_filtered.count()
print(count)

AThe cache() is called but no action triggers caching before count()

BThe count() action is missing, so no computation happens

CThe RDD is too small to benefit from caching

DThe filter transformation is executed before flatMap causing errors

Attempts:

2 left

🚀 Application

expert

3:00remaining

Optimizing a Spark pipeline with wide and narrow dependencies

You have this Spark pipeline. Which option correctly describes how Spark will execute it?

Apache Spark

rdd = sc.parallelize(range(10))
step1 = rdd.map(lambda x: x + 1)
step2 = step1.groupBy(lambda x: x % 3)
step3 = step2.mapValues(sum)
result = step3.collect()

ASpark will execute all transformations immediately without stages

BSpark will create one stage combining map and groupBy

CSpark will create two stages: one for map (narrow) and one for groupBy (wide) with shuffle

DSpark will create three stages: map, groupBy, and mapValues separately

Attempts:

2 left