Challenge - 5 Problems
Spark Pipeline Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate2:00remaining
Why are Spark transformations lazy?
In Apache Spark, transformations like
map() and filter() are called lazy. What is the main reason for this laziness?Attempts:
2 left
💡 Hint
Think about how Spark plans work before running jobs.
✗ Incorrect
Spark delays execution of transformations to build a pipeline of operations. This allows it to optimize the entire workflow before running any actual computation.
❓ Predict Output
intermediate2:00remaining
Output of chained transformations in Spark
Given the following Spark code, what is the output when
result.collect() is called?Apache Spark
rdd = sc.parallelize([1, 2, 3, 4, 5]) result = rdd.filter(lambda x: x % 2 == 0).map(lambda x: x * 10)
Attempts:
2 left
💡 Hint
Filter keeps even numbers, then map multiplies by 10.
✗ Incorrect
The filter keeps only even numbers [2,4], then map multiplies each by 10 resulting in [20,40].
❓ data_output
advanced2:00remaining
Number of stages in a Spark pipeline
Consider this Spark code snippet. How many stages will Spark create when
result.count() is called?Apache Spark
rdd = sc.parallelize([1, 2, 3, 4, 5]) result = rdd.map(lambda x: x + 1).filter(lambda x: x > 3).map(lambda x: x * 2)
Attempts:
2 left
💡 Hint
All transformations are narrow dependencies.
✗ Incorrect
Since all transformations are narrow (map and filter), Spark can combine them into a single stage.
🔧 Debug
advanced2:00remaining
Why does this Spark job run slowly?
This Spark code runs slowly. What is the main reason?
Apache Spark
rdd = sc.textFile('data.txt') words = rdd.flatMap(lambda line: line.split()) words_filtered = words.filter(lambda w: len(w) > 3) words_filtered.cache() count = words_filtered.count() print(count)
Attempts:
2 left
💡 Hint
Caching happens only after an action triggers computation.
✗ Incorrect
Calling cache() marks the RDD to be cached but caching happens only when an action like count() runs. Here, count() triggers caching, but if multiple actions run later, caching helps. If only one action runs, caching adds overhead.
🚀 Application
expert3:00remaining
Optimizing a Spark pipeline with wide and narrow dependencies
You have this Spark pipeline. Which option correctly describes how Spark will execute it?
Apache Spark
rdd = sc.parallelize(range(10)) step1 = rdd.map(lambda x: x + 1) step2 = step1.groupBy(lambda x: x % 3) step3 = step2.mapValues(sum) result = step3.collect()
Attempts:
2 left
💡 Hint
Wide dependencies like groupBy cause shuffle and new stages.
✗ Incorrect
The map is a narrow transformation and runs in one stage. The groupBy causes a shuffle (wide dependency), so Spark creates a new stage for it. mapValues is a narrow transformation and runs in the same stage as groupBy output processing.