Challenge - 5 Problems

🎖️

Spark Transformations & Actions Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Understanding lazy evaluation in Spark

What will be the output of this Spark code snippet?

Apache Spark

val rdd = sc.parallelize(Seq(1, 2, 3, 4))
val mapped = rdd.map(x => x * 2)
// No action called yet
mapped.count()

CNo output, code runs lazily without action

DThrows an error because no action is called

Attempts:

2 left

❓ data_output

intermediate

2:00remaining

Result of chained transformations and action

Given this Spark code, what is the output of the collect() action?

Apache Spark

val rdd = sc.parallelize(Seq(1, 2, 3, 4, 5))
val filtered = rdd.filter(_ % 2 == 0)
val mapped = filtered.map(_ * 10)
mapped.collect()

A[10, 20]

B[2, 4]

C[1, 3, 5]

D[20, 40]

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Why does this Spark job not run?

Consider this Spark code snippet. Why does it not produce any output or run any job?

Apache Spark

val rdd = sc.parallelize(Seq(1, 2, 3))
rdd.map(_ * 2)

ABecause <code>map</code> is a transformation and no action is called to trigger execution

BBecause <code>map</code> is an action and should run immediately

CBecause the RDD is empty

DBecause SparkContext is not initialized

Attempts:

2 left

🧠 Conceptual

advanced

2:00remaining

Difference between transformations and actions

Which statement correctly describes the difference between transformations and actions in Spark?

ATransformations immediately compute results; actions only define computations

BActions are lazy and transformations trigger execution

CTransformations define a new RDD without executing; actions trigger execution and return results

DBoth transformations and actions trigger execution immediately

Attempts:

2 left

🚀 Application

expert

3:00remaining

Optimizing Spark job with transformations and actions

You have this Spark code. Which option will minimize the number of Spark jobs executed?

Apache Spark

val rdd = sc.parallelize(Seq(1, 2, 3, 4, 5))
val mapped = rdd.map(_ * 2)
val filtered = mapped.filter(_ > 5)
val countResult = filtered.count()
val collectResult = filtered.collect()

ACall count and collect without caching; Spark will optimize automatically

BCache the filtered RDD before calling count and collect

CCall collect first, then count; order does not matter

DUse two separate RDDs for count and collect to avoid recomputation

Attempts:

2 left