0
0
Apache Sparkdata~20 mins

What is Apache Spark - Practice Questions & Exercises

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Apache Spark Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
Understanding Apache Spark's Core Purpose

What is the main purpose of Apache Spark in data processing?

ATo provide a fast and general engine for large-scale data processing
BTo store data permanently like a database
CTo create visual dashboards for data analysis
DTo replace all programming languages with a new one
Attempts:
2 left
💡 Hint

Think about what Spark does with big data quickly.

Predict Output
intermediate
2:00remaining
Output of a Spark DataFrame Operation

What will be the output of the following Spark code snippet?

Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').getOrCreate()
data = [(1, 'apple'), (2, 'banana'), (3, 'cherry')]
df = spark.createDataFrame(data, ['id', 'fruit'])
df_filtered = df.filter(df.id > 1)
df_filtered.show()
A
+---+------+
| id| fruit|
+---+------+
|  1| apple|
|  2|banana|
|  3|cherry|
+---+------+
B
+---+------+
| id| fruit|
+---+------+
|  2|banana|
|  3|cherry|
+---+------+
CSyntaxError: invalid syntax
DRuntimeError: filter condition invalid
Attempts:
2 left
💡 Hint

Filter keeps rows where id is greater than 1.

data_output
advanced
2:00remaining
Result of a Spark Aggregation

Given the following Spark DataFrame, what is the output of the aggregation?

Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import avg
spark = SparkSession.builder.appName('Test').getOrCreate()
data = [(1, 10), (2, 20), (3, 30), (4, 40)]
df = spark.createDataFrame(data, ['id', 'value'])
avg_df = df.agg(avg('value').alias('average_value'))
avg_df.show()
A
+-------------+
|average_value|
+-------------+
|         25.0|
+-------------+
B
+-----+-----+
|   id|value|
+-----+-----+
|    1|   10|
|    2|   20|
|    3|   30|
|    4|   40|
+-----+-----+
CTypeError: avg() missing 1 required positional argument
DRuntimeError: aggregation failed
Attempts:
2 left
💡 Hint

Average of 10, 20, 30, 40 is (10+20+30+40)/4.

visualization
advanced
3:00remaining
Visualizing Data with Spark and Matplotlib

Which option correctly shows how to collect Spark DataFrame data and plot a bar chart of 'fruit' counts using matplotlib?

Apache Spark
from pyspark.sql import SparkSession
import matplotlib.pyplot as plt
spark = SparkSession.builder.appName('Test').getOrCreate()
data = [('apple', 4), ('banana', 2), ('cherry', 5)]
df = spark.createDataFrame(data, ['fruit', 'count'])
A
data = df.toPandas()
plt.bar(data.fruit, data['count'])
plt.show()
B
plt.bar(df['fruit'], df['count'])
plt.show()
C
data = df.collect()
fruits = [row['fruit'] for row in data]
counts = [row['count'] for row in data]
plt.bar(fruits, counts)
plt.show()
D
df.plot(kind='bar')
plt.show()
Attempts:
2 left
💡 Hint

You need to convert Spark DataFrame rows to Python lists before plotting.

🔧 Debug
expert
2:00remaining
Identifying the Error in Spark Code

What error will this Spark code raise?

Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').getOrCreate()
data = [(1, 'a'), (2, 'b')]
df = spark.createDataFrame(data, ['id', 'letter'])
result = df.groupBy('id').sum('letter').show()
ANo error, outputs sum of letters
B
+---+-----------+
| id|sum(letter)|
+---+-----------+
|  1|          a|
|  2|          b|
+---+-----------+
CTypeError: sum() missing 1 required positional argument
DAnalysisException: cannot resolve 'sum(letter)' due to data type mismatch
Attempts:
2 left
💡 Hint

Sum aggregation expects numeric columns, but 'letter' is text.