Challenge - 5 Problems

🎖️

Apache Spark Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

2:00remaining

Understanding Apache Spark's Core Purpose

What is the main purpose of Apache Spark in data processing?

ATo provide a fast and general engine for large-scale data processing

BTo store data permanently like a database

CTo create visual dashboards for data analysis

DTo replace all programming languages with a new one

Attempts:

2 left

❓ Predict Output

intermediate

2:00remaining

Output of a Spark DataFrame Operation

What will be the output of the following Spark code snippet?

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').getOrCreate()
data = [(1, 'apple'), (2, 'banana'), (3, 'cherry')]
df = spark.createDataFrame(data, ['id', 'fruit'])
df_filtered = df.filter(df.id > 1)
df_filtered.show()

+---+------+
| id| fruit|
+---+------+
|  1| apple|
|  2|banana|
|  3|cherry|
+---+------+

+---+------+
| id| fruit|
+---+------+
|  2|banana|
|  3|cherry|
+---+------+

CSyntaxError: invalid syntax

DRuntimeError: filter condition invalid

Attempts:

2 left

❓ data_output

advanced

2:00remaining

Result of a Spark Aggregation

Given the following Spark DataFrame, what is the output of the aggregation?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import avg
spark = SparkSession.builder.appName('Test').getOrCreate()
data = [(1, 10), (2, 20), (3, 30), (4, 40)]
df = spark.createDataFrame(data, ['id', 'value'])
avg_df = df.agg(avg('value').alias('average_value'))
avg_df.show()

+-------------+
|average_value|
+-------------+
|         25.0|
+-------------+

+-----+-----+
|   id|value|
+-----+-----+
|    1|   10|
|    2|   20|
|    3|   30|
|    4|   40|
+-----+-----+

CTypeError: avg() missing 1 required positional argument

DRuntimeError: aggregation failed

Attempts:

2 left

❓ visualization

advanced

3:00remaining

Visualizing Data with Spark and Matplotlib

Which option correctly shows how to collect Spark DataFrame data and plot a bar chart of 'fruit' counts using matplotlib?

Apache Spark

from pyspark.sql import SparkSession
import matplotlib.pyplot as plt
spark = SparkSession.builder.appName('Test').getOrCreate()
data = [('apple', 4), ('banana', 2), ('cherry', 5)]
df = spark.createDataFrame(data, ['fruit', 'count'])

data = df.toPandas()
plt.bar(data.fruit, data['count'])
plt.show()

plt.bar(df['fruit'], df['count'])
plt.show()

data = df.collect()
fruits = [row['fruit'] for row in data]
counts = [row['count'] for row in data]
plt.bar(fruits, counts)
plt.show()

df.plot(kind='bar')
plt.show()

Attempts:

2 left

🔧 Debug

expert

2:00remaining

Identifying the Error in Spark Code

What error will this Spark code raise?

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').getOrCreate()
data = [(1, 'a'), (2, 'b')]
df = spark.createDataFrame(data, ['id', 'letter'])
result = df.groupBy('id').sum('letter').show()

ANo error, outputs sum of letters

+---+-----------+
| id|sum(letter)|
+---+-----------+
|  1|          a|
|  2|          b|
+---+-----------+

CTypeError: sum() missing 1 required positional argument

DAnalysisException: cannot resolve 'sum(letter)' due to data type mismatch

Attempts:

2 left