Challenge - 5 Problems
GroupBy Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of GroupBy with sum aggregation
What is the output of the following Apache Spark code snippet?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import sum spark = SparkSession.builder.appName('test').getOrCreate() data = [(1, 'A', 10), (2, 'B', 20), (3, 'A', 30), (4, 'B', 40)] columns = ['id', 'category', 'value'] df = spark.createDataFrame(data, columns) result = df.groupBy('category').agg(sum('value').alias('total_value')).orderBy('category') result.show()
Attempts:
2 left
💡 Hint
Sum the 'value' column grouped by 'category'.
✗ Incorrect
The sum aggregation adds values for each category: for 'A' it's 10 + 30 = 40, for 'B' it's 20 + 40 = 60.
❓ data_output
intermediate2:00remaining
Count distinct values per group
Given the DataFrame below, what is the output of counting distinct 'value' per 'category'?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import countDistinct spark = SparkSession.builder.appName('test').getOrCreate() data = [(1, 'A', 10), (2, 'B', 20), (3, 'A', 10), (4, 'B', 40), (5, 'A', 30)] columns = ['id', 'category', 'value'] df = spark.createDataFrame(data, columns) result = df.groupBy('category').agg(countDistinct('value').alias('distinct_values')).orderBy('category') result.show()
Attempts:
2 left
💡 Hint
Count unique 'value' entries per 'category'.
✗ Incorrect
Category 'A' has values 10, 10, 30 → 2 distinct values; 'B' has 20, 40 → 2 distinct values.
❓ visualization
advanced3:00remaining
Visualizing average values per group
Which option correctly describes the bar chart produced by the following code?
Apache Spark
import matplotlib.pyplot as plt from pyspark.sql import SparkSession from pyspark.sql.functions import avg spark = SparkSession.builder.appName('test').getOrCreate() data = [(1, 'X', 5), (2, 'Y', 15), (3, 'X', 10), (4, 'Y', 20), (5, 'Z', 25)] columns = ['id', 'group', 'score'] df = spark.createDataFrame(data, columns) avg_df = df.groupBy('group').agg(avg('score').alias('avg_score')).orderBy('group') pandas_df = avg_df.toPandas() plt.bar(pandas_df['group'], pandas_df['avg_score']) plt.xlabel('Group') plt.ylabel('Average Score') plt.title('Average Score per Group') plt.show()
Attempts:
2 left
💡 Hint
The code uses avg aggregation and bar chart.
✗ Incorrect
The average scores are (5+10)/2=7.5 for X, (15+20)/2=17.5 for Y, and 25 for Z. The chart is a bar chart.
🔧 Debug
advanced2:00remaining
Identify the error in aggregation code
What error will the following Apache Spark code raise?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import mean spark = SparkSession.builder.appName('test').getOrCreate() data = [(1, 'A', 10), (2, 'B', 20)] columns = ['id', 'category', 'value'] df = spark.createDataFrame(data, columns) result = df.groupBy('category').agg(mean('values').alias('avg_value')) result.show()
Attempts:
2 left
💡 Hint
Check the column name used in mean function.
✗ Incorrect
The code uses 'values' instead of 'value' which does not exist in DataFrame columns, causing AnalysisException.
🚀 Application
expert3:00remaining
Calculate weighted average per group
You have a DataFrame with columns 'group', 'score', and 'weight'. Which code snippet correctly calculates the weighted average score per group?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import sum, col spark = SparkSession.builder.appName('test').getOrCreate() data = [('A', 80, 0.5), ('A', 90, 0.5), ('B', 70, 0.3), ('B', 60, 0.7)] columns = ['group', 'score', 'weight'] df = spark.createDataFrame(data, columns)
Attempts:
2 left
💡 Hint
Use col() for column operations inside sum().
✗ Incorrect
Option D correctly uses col() for both numerator and denominator sums and shows the result. Option D misses col() in denominator causing error. Option D uses strings incorrectly. Option D collects but does not show output.