0
0
Apache Sparkdata~20 mins

Why transformations build processing pipelines in Apache Spark - Challenge Your Understanding

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Spark Pipeline Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
Why are Spark transformations lazy?
In Apache Spark, transformations like map() and filter() are called lazy. What is the main reason for this laziness?
ATo build a processing pipeline that optimizes execution before running any computation
BTo immediately execute each transformation and store intermediate results
CTo prevent any data from being processed at all
DTo automatically cache all data in memory after each transformation
Attempts:
2 left
💡 Hint
Think about how Spark plans work before running jobs.
Predict Output
intermediate
2:00remaining
Output of chained transformations in Spark
Given the following Spark code, what is the output when result.collect() is called?
Apache Spark
rdd = sc.parallelize([1, 2, 3, 4, 5])
result = rdd.filter(lambda x: x % 2 == 0).map(lambda x: x * 10)
A[2, 4]
B[20, 40]
C[10, 20, 30, 40, 50]
D[]
Attempts:
2 left
💡 Hint
Filter keeps even numbers, then map multiplies by 10.
data_output
advanced
2:00remaining
Number of stages in a Spark pipeline
Consider this Spark code snippet. How many stages will Spark create when result.count() is called?
Apache Spark
rdd = sc.parallelize([1, 2, 3, 4, 5])
result = rdd.map(lambda x: x + 1).filter(lambda x: x > 3).map(lambda x: x * 2)
A2
B4
C3
D1
Attempts:
2 left
💡 Hint
All transformations are narrow dependencies.
🔧 Debug
advanced
2:00remaining
Why does this Spark job run slowly?
This Spark code runs slowly. What is the main reason?
Apache Spark
rdd = sc.textFile('data.txt')
words = rdd.flatMap(lambda line: line.split())
words_filtered = words.filter(lambda w: len(w) > 3)
words_filtered.cache()
count = words_filtered.count()
print(count)
AThe cache() is called but no action triggers caching before count()
BThe count() action is missing, so no computation happens
CThe RDD is too small to benefit from caching
DThe filter transformation is executed before flatMap causing errors
Attempts:
2 left
💡 Hint
Caching happens only after an action triggers computation.
🚀 Application
expert
3:00remaining
Optimizing a Spark pipeline with wide and narrow dependencies
You have this Spark pipeline. Which option correctly describes how Spark will execute it?
Apache Spark
rdd = sc.parallelize(range(10))
step1 = rdd.map(lambda x: x + 1)
step2 = step1.groupBy(lambda x: x % 3)
step3 = step2.mapValues(sum)
result = step3.collect()
ASpark will execute all transformations immediately without stages
BSpark will create one stage combining map and groupBy
CSpark will create two stages: one for map (narrow) and one for groupBy (wide) with shuffle
DSpark will create three stages: map, groupBy, and mapValues separately
Attempts:
2 left
💡 Hint
Wide dependencies like groupBy cause shuffle and new stages.