Challenge - 5 Problems

🎖️

Spark SQL Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

What is the output of this Spark SQL query on a DataFrame?

Given a Spark DataFrame df with columns id and value:

+---+-----+
|id |value|
+---+-----+
|1  |10   |
|2  |20   |
|3  |30   |
|4  |40   |
+---+-----+

What will be the result of this SQL query?

df.createOrReplaceTempView("table1")
spark.sql("SELECT id, value * 2 AS double_value FROM table1 WHERE id > 2 ORDER BY id").show()

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [(1,10),(2,20),(3,30),(4,40)]
df = spark.createDataFrame(data, ["id", "value"])
df.createOrReplaceTempView("table1")
result = spark.sql("SELECT id, value * 2 AS double_value FROM table1 WHERE id > 2 ORDER BY id")
result.show()

A[Row(id=3, double_value=60), Row(id=4, double_value=80)]

B[Row(id=1, double_value=20), Row(id=2, double_value=40)]

C[Row(id=3, double_value=30), Row(id=4, double_value=40)]

D[Row(id=2, double_value=40), Row(id=3, double_value=60)]

Attempts:

2 left

❓ data_output

intermediate

1:30remaining

How many rows are returned by this SQL query on a DataFrame?

Consider a Spark DataFrame df with the following data:

+---+-------+
|id |status |
+---+-------+
|1  |active |
|2  |inactive|
|3  |active |
|4  |inactive|
|5  |active |
+---+-------+

What is the number of rows returned by this SQL query?

df.createOrReplaceTempView("users")
spark.sql("SELECT * FROM users WHERE status = 'active'").count()

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [(1,"active"),(2,"inactive"),(3,"active"),(4,"inactive"),(5,"active")]
df = spark.createDataFrame(data, ["id", "status"])
df.createOrReplaceTempView("users")
count_active = spark.sql("SELECT * FROM users WHERE status = 'active'").count()

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

What error does this Spark SQL code raise?

Given a Spark DataFrame df with columns name and age, what error will this code produce?

df.createOrReplaceTempView("people")
spark.sql("SELECT name, age FROM people WHERE age > '30'").show()

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [("Alice", 25), ("Bob", 35), ("Charlie", 30)]
df = spark.createDataFrame(data, ["name", "age"])
df.createOrReplaceTempView("people")
spark.sql("SELECT name, age FROM people WHERE age > '30'").show()

AAnalysisException: cannot resolve 'age > '30'' due to data type mismatch

BNo error, outputs rows with age > 30

CRuntimeError: invalid comparison between int and string

DSyntaxError: invalid syntax in SQL query

Attempts:

2 left

❓ visualization

advanced

3:00remaining

Which option shows the correct bar chart output from this Spark SQL aggregation?

Given a Spark DataFrame sales with columns product and amount:

+---------+-------+
|product  |amount |
+---------+-------+
|A        |100    |
|B        |200    |
|A        |150    |
|B        |50     |
|C        |300    |
+---------+-------+

After running this SQL query and plotting a bar chart of total sales per product, which chart matches the data?

sales.createOrReplaceTempView("sales_table")
result = spark.sql("SELECT product, SUM(amount) AS total FROM sales_table GROUP BY product ORDER BY product")
result.show()

Apache Spark

from pyspark.sql import SparkSession
import matplotlib.pyplot as plt
spark = SparkSession.builder.getOrCreate()
data = [("A",100),("B",200),("A",150),("B",50),("C",300)]
sales = spark.createDataFrame(data, ["product", "amount"])
sales.createOrReplaceTempView("sales_table")
result = spark.sql("SELECT product, SUM(amount) AS total FROM sales_table GROUP BY product ORDER BY product")
pandas_df = result.toPandas()
plt.bar(pandas_df['product'], pandas_df['total'])
plt.xlabel('Product')
plt.ylabel('Total Sales')
plt.title('Total Sales per Product')
plt.show()

ABar heights: A=250, B=300, C=200

BBar heights: A=100, B=200, C=300

CBar heights: A=150, B=50, C=300

DBar heights: A=250, B=250, C=300

Attempts:

2 left

🧠 Conceptual

expert

2:30remaining

Which option best describes the behavior of Spark SQL when joining two DataFrames with duplicate column names?

Suppose you join two Spark DataFrames df1 and df2 on a key column, and both have a column named value. What happens to the value columns in the resulting DataFrame?

ASpark SQL requires manual aliasing to avoid duplicate column names.

BSpark SQL throws an AnalysisException due to ambiguous column names.

CSpark SQL automatically renames duplicate columns by appending suffixes like <code>value</code> and <code>value_1</code>.

DSpark SQL keeps only one <code>value</code> column, dropping the other silently.

Attempts:

2 left