0
0
Apache Sparkdata~20 mins

GroupBy and aggregations in Apache Spark - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
GroupBy Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of GroupBy with sum aggregation
What is the output of the following Apache Spark code snippet?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import sum

spark = SparkSession.builder.appName('test').getOrCreate()
data = [(1, 'A', 10), (2, 'B', 20), (3, 'A', 30), (4, 'B', 40)]
columns = ['id', 'category', 'value']
df = spark.createDataFrame(data, columns)
result = df.groupBy('category').agg(sum('value').alias('total_value')).orderBy('category')
result.show()
A[Row(category='A', total_value=10), Row(category='B', total_value=20)]
B[Row(category='A', total_value=30), Row(category='B', total_value=40)]
C[Row(category='A', total_value=40), Row(category='B', total_value=60)]
D[Row(category='A', total_value=70), Row(category='B', total_value=60)]
Attempts:
2 left
💡 Hint
Sum the 'value' column grouped by 'category'.
data_output
intermediate
2:00remaining
Count distinct values per group
Given the DataFrame below, what is the output of counting distinct 'value' per 'category'?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import countDistinct

spark = SparkSession.builder.appName('test').getOrCreate()
data = [(1, 'A', 10), (2, 'B', 20), (3, 'A', 10), (4, 'B', 40), (5, 'A', 30)]
columns = ['id', 'category', 'value']
df = spark.createDataFrame(data, columns)
result = df.groupBy('category').agg(countDistinct('value').alias('distinct_values')).orderBy('category')
result.show()
A[Row(category='A', distinct_values=2), Row(category='B', distinct_values=3)]
B[Row(category='A', distinct_values=2), Row(category='B', distinct_values=2)]
C[Row(category='A', distinct_values=3), Row(category='B', distinct_values=3)]
D[Row(category='A', distinct_values=3), Row(category='B', distinct_values=2)]
Attempts:
2 left
💡 Hint
Count unique 'value' entries per 'category'.
visualization
advanced
3:00remaining
Visualizing average values per group
Which option correctly describes the bar chart produced by the following code?
Apache Spark
import matplotlib.pyplot as plt
from pyspark.sql import SparkSession
from pyspark.sql.functions import avg

spark = SparkSession.builder.appName('test').getOrCreate()
data = [(1, 'X', 5), (2, 'Y', 15), (3, 'X', 10), (4, 'Y', 20), (5, 'Z', 25)]
columns = ['id', 'group', 'score']
df = spark.createDataFrame(data, columns)
avg_df = df.groupBy('group').agg(avg('score').alias('avg_score')).orderBy('group')
pandas_df = avg_df.toPandas()
plt.bar(pandas_df['group'], pandas_df['avg_score'])
plt.xlabel('Group')
plt.ylabel('Average Score')
plt.title('Average Score per Group')
plt.show()
ABar chart with groups X, Y, Z on x-axis and average scores 7.5, 17.5, 25 respectively on y-axis.
BBar chart with groups X, Y, Z on x-axis and total scores 15, 35, 25 respectively on y-axis.
CLine chart with groups X, Y, Z on x-axis and average scores 7.5, 17.5, 25 respectively on y-axis.
DBar chart with groups X, Y, Z on x-axis and median scores 7.5, 17.5, 25 respectively on y-axis.
Attempts:
2 left
💡 Hint
The code uses avg aggregation and bar chart.
🔧 Debug
advanced
2:00remaining
Identify the error in aggregation code
What error will the following Apache Spark code raise?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import mean

spark = SparkSession.builder.appName('test').getOrCreate()
data = [(1, 'A', 10), (2, 'B', 20)]
columns = ['id', 'category', 'value']
df = spark.createDataFrame(data, columns)
result = df.groupBy('category').agg(mean('values').alias('avg_value'))
result.show()
AAnalysisException: cannot resolve '`values`' given input columns: [id, category, value]
BAttributeError: 'DataFrame' object has no attribute 'mean'
CNo error, outputs average values per category
DTypeError: mean() missing 1 required positional argument
Attempts:
2 left
💡 Hint
Check the column name used in mean function.
🚀 Application
expert
3:00remaining
Calculate weighted average per group
You have a DataFrame with columns 'group', 'score', and 'weight'. Which code snippet correctly calculates the weighted average score per group?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import sum, col

spark = SparkSession.builder.appName('test').getOrCreate()
data = [('A', 80, 0.5), ('A', 90, 0.5), ('B', 70, 0.3), ('B', 60, 0.7)]
columns = ['group', 'score', 'weight']
df = spark.createDataFrame(data, columns)
Adf.groupBy('group').agg((sum('score' * 'weight') / sum('weight')).alias('weighted_avg')).show()
Bdf.groupBy('group').agg((sum(col('score') * col('weight')) / sum('weight')).alias('weighted_avg')).show()
Cdf.groupBy('group').agg((sum(col('score') * col('weight')) / sum('weight')).alias('weighted_avg')).collect()
Ddf.groupBy('group').agg((sum(col('score') * col('weight')) / sum(col('weight'))).alias('weighted_avg')).show()
Attempts:
2 left
💡 Hint
Use col() for column operations inside sum().