0
0
Apache Sparkdata~20 mins

SQL queries on DataFrames in Apache Spark - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Spark SQL Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
What is the output of this Spark SQL query on a DataFrame?

Given a Spark DataFrame df with columns id and value:

+---+-----+
|id |value|
+---+-----+
|1  |10   |
|2  |20   |
|3  |30   |
|4  |40   |
+---+-----+

What will be the result of this SQL query?

df.createOrReplaceTempView("table1")
spark.sql("SELECT id, value * 2 AS double_value FROM table1 WHERE id > 2 ORDER BY id").show()
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [(1,10),(2,20),(3,30),(4,40)]
df = spark.createDataFrame(data, ["id", "value"])
df.createOrReplaceTempView("table1")
result = spark.sql("SELECT id, value * 2 AS double_value FROM table1 WHERE id > 2 ORDER BY id")
result.show()
A[Row(id=3, double_value=60), Row(id=4, double_value=80)]
B[Row(id=1, double_value=20), Row(id=2, double_value=40)]
C[Row(id=3, double_value=30), Row(id=4, double_value=40)]
D[Row(id=2, double_value=40), Row(id=3, double_value=60)]
Attempts:
2 left
💡 Hint

Remember the WHERE clause filters rows where id is greater than 2.

data_output
intermediate
1:30remaining
How many rows are returned by this SQL query on a DataFrame?

Consider a Spark DataFrame df with the following data:

+---+-------+
|id |status |
+---+-------+
|1  |active |
|2  |inactive|
|3  |active |
|4  |inactive|
|5  |active |
+---+-------+

What is the number of rows returned by this SQL query?

df.createOrReplaceTempView("users")
spark.sql("SELECT * FROM users WHERE status = 'active'").count()
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [(1,"active"),(2,"inactive"),(3,"active"),(4,"inactive"),(5,"active")]
df = spark.createDataFrame(data, ["id", "status"])
df.createOrReplaceTempView("users")
count_active = spark.sql("SELECT * FROM users WHERE status = 'active'").count()
A3
B2
C5
D4
Attempts:
2 left
💡 Hint

Count only rows where status is exactly 'active'.

🔧 Debug
advanced
2:00remaining
What error does this Spark SQL code raise?

Given a Spark DataFrame df with columns name and age, what error will this code produce?

df.createOrReplaceTempView("people")
spark.sql("SELECT name, age FROM people WHERE age > '30'").show()
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [("Alice", 25), ("Bob", 35), ("Charlie", 30)]
df = spark.createDataFrame(data, ["name", "age"])
df.createOrReplaceTempView("people")
spark.sql("SELECT name, age FROM people WHERE age > '30'").show()
AAnalysisException: cannot resolve 'age > '30'' due to data type mismatch
BNo error, outputs rows with age > 30
CRuntimeError: invalid comparison between int and string
DSyntaxError: invalid syntax in SQL query
Attempts:
2 left
💡 Hint

Check how Spark SQL handles comparisons between integers and strings.

visualization
advanced
3:00remaining
Which option shows the correct bar chart output from this Spark SQL aggregation?

Given a Spark DataFrame sales with columns product and amount:

+---------+-------+
|product  |amount |
+---------+-------+
|A        |100    |
|B        |200    |
|A        |150    |
|B        |50     |
|C        |300    |
+---------+-------+

After running this SQL query and plotting a bar chart of total sales per product, which chart matches the data?

sales.createOrReplaceTempView("sales_table")
result = spark.sql("SELECT product, SUM(amount) AS total FROM sales_table GROUP BY product ORDER BY product")
result.show()
Apache Spark
from pyspark.sql import SparkSession
import matplotlib.pyplot as plt
spark = SparkSession.builder.getOrCreate()
data = [("A",100),("B",200),("A",150),("B",50),("C",300)]
sales = spark.createDataFrame(data, ["product", "amount"])
sales.createOrReplaceTempView("sales_table")
result = spark.sql("SELECT product, SUM(amount) AS total FROM sales_table GROUP BY product ORDER BY product")
pandas_df = result.toPandas()
plt.bar(pandas_df['product'], pandas_df['total'])
plt.xlabel('Product')
plt.ylabel('Total Sales')
plt.title('Total Sales per Product')
plt.show()
ABar heights: A=250, B=300, C=200
BBar heights: A=100, B=200, C=300
CBar heights: A=150, B=50, C=300
DBar heights: A=250, B=250, C=300
Attempts:
2 left
💡 Hint

Sum amounts per product: add all amounts for each product.

🧠 Conceptual
expert
2:30remaining
Which option best describes the behavior of Spark SQL when joining two DataFrames with duplicate column names?

Suppose you join two Spark DataFrames df1 and df2 on a key column, and both have a column named value. What happens to the value columns in the resulting DataFrame?

ASpark SQL requires manual aliasing to avoid duplicate column names.
BSpark SQL throws an AnalysisException due to ambiguous column names.
CSpark SQL automatically renames duplicate columns by appending suffixes like <code>value</code> and <code>value_1</code>.
DSpark SQL keeps only one <code>value</code> column, dropping the other silently.
Attempts:
2 left
💡 Hint

Think about how Spark handles columns with the same name after join.