0
0
Apache Sparkdata~20 mins

Creating RDDs from collections and files in Apache Spark - Practice Exercises

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
RDD Mastery Badge
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of RDD count after filtering
What is the output of the following Spark code snippet?
Apache Spark
val sc = spark.sparkContext
val data = List(1, 2, 3, 4, 5, 6)
val rdd = sc.parallelize(data)
val filteredRDD = rdd.filter(x => x % 2 == 0)
val count = filteredRDD.count()
println(count)
A4
B6
C2
D3
Attempts:
2 left
💡 Hint
Count how many numbers in the list are even.
data_output
intermediate
1:30remaining
Result of reading a text file into an RDD
Given a text file with 4 lines, what will be the number of elements in the RDD after reading it with sc.textFile?
Apache Spark
val rdd = sc.textFile("/path/to/file.txt")
rdd.count()
A1
B4
C0
DDepends on file size
Attempts:
2 left
💡 Hint
Each line in the file becomes one element in the RDD.
🔧 Debug
advanced
2:00remaining
Identify the error in RDD creation from a collection
What error will this Spark code produce?
Apache Spark
val data = 1 to 5
val rdd = sc.parallelize(data.toString())
rdd.collect().foreach(println)
AThe RDD contains characters of the string 'Range(1,2,3,4,5)'
BCompilation error: toString cannot be used here
CRuntime error: Unsupported operation on Range
DThe RDD contains numbers 1 to 5 as expected
Attempts:
2 left
💡 Hint
Check what data.toString returns and what parallelize expects.
🧠 Conceptual
advanced
1:30remaining
Difference between sc.parallelize and sc.textFile
Which statement correctly describes the difference between sc.parallelize and sc.textFile?
ABoth create RDDs from collections but sc.textFile supports filtering
Bsc.parallelize reads data from a file, sc.textFile creates an RDD from a collection
Csc.parallelize creates an RDD from an existing collection in memory, sc.textFile reads data from a file into an RDD
DBoth create RDDs from files but sc.parallelize is faster
Attempts:
2 left
💡 Hint
Think about the source of data for each method.
visualization
expert
2:30remaining
Visualizing partitions of an RDD created from a file
You create an RDD from a text file with 100 lines using sc.textFile with 4 partitions. Which visualization best represents the distribution of lines across partitions?
AA bar chart with 4 bars each showing approximately 25 lines
BA pie chart with 100 equal slices representing each line
CA line chart showing increasing number of lines per partition from 1 to 4
DA scatter plot with 100 points randomly scattered
Attempts:
2 left
💡 Hint
Partitions try to split data evenly.