Challenge - 5 Problems
Windowed Aggregations Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of a sliding window sum
Given the following Spark code, what is the output of the
result.show() command?Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.window import Window from pyspark.sql.functions import sum spark = SparkSession.builder.master('local').appName('test').getOrCreate() data = [(1, 10), (2, 20), (3, 30), (4, 40), (5, 50)] df = spark.createDataFrame(data, ['id', 'value']) windowSpec = Window.orderBy('id').rowsBetween(-1, 1) result = df.withColumn('window_sum', sum('value').over(windowSpec)) result.orderBy('id').select('id', 'window_sum').collect()
Attempts:
2 left
💡 Hint
Think about how the window frame includes the current row and one row before and after.
✗ Incorrect
The window frame includes the current row and one row before and after, so for id=1, sum is 10+20=30; for id=2, sum is 10+20+30=60; and so on.
❓ data_output
intermediate2:00remaining
Count of rows in a partitioned window
What is the output of the following Spark code snippet that counts rows per group using window functions?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.window import Window from pyspark.sql.functions import count spark = SparkSession.builder.master('local').appName('test').getOrCreate() data = [('A', 1), ('A', 2), ('B', 3), ('B', 4), ('B', 5)] df = spark.createDataFrame(data, ['group', 'value']) windowSpec = Window.partitionBy('group').orderBy('value').rowsBetween(Window.unboundedPreceding, Window.currentRow) result = df.withColumn('count_so_far', count('value').over(windowSpec)) result.orderBy('group', 'value').select('group', 'value', 'count_so_far').collect()
Attempts:
2 left
💡 Hint
Count accumulates rows from the start of the partition up to the current row.
✗ Incorrect
The window counts rows from the first row in each group up to the current row, so counts increase by 1 for each row in the group.
❓ visualization
advanced2:30remaining
Visualizing moving average with window functions
You have a Spark DataFrame with daily sales data. You want to create a new column showing the 3-day moving average of sales. Which code snippet correctly computes this moving average?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.window import Window from pyspark.sql.functions import avg spark = SparkSession.builder.master('local').appName('test').getOrCreate() data = [(1, 100), (2, 200), (3, 300), (4, 400), (5, 500)] df = spark.createDataFrame(data, ['day', 'sales']) windowSpec = Window.orderBy('day').rowsBetween(-2, 0) result = df.withColumn('moving_avg', avg('sales').over(windowSpec)) result.orderBy('day').select('day', 'moving_avg').collect()
Attempts:
2 left
💡 Hint
The window frame includes the current day and the two previous days.
✗ Incorrect
The 3-day moving average is calculated by averaging sales of the current day and the two previous days. For example, day 3 average is (100+200+300)/3 = 200.
🔧 Debug
advanced2:00remaining
Identify the error in window frame specification
What error will the following Spark code raise when executed?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.window import Window from pyspark.sql.functions import sum spark = SparkSession.builder.master('local').appName('test').getOrCreate() data = [(1, 10), (2, 20), (3, 30)] df = spark.createDataFrame(data, ['id', 'value']) windowSpec = Window.orderBy('id').rowsBetween(1, -1) result = df.withColumn('window_sum', sum('value').over(windowSpec)) result.show()
Attempts:
2 left
💡 Hint
Check the order of the frame boundaries in rowsBetween.
✗ Incorrect
The rowsBetween method requires the start boundary to be less than or equal to the end boundary. Here, 1 > -1 causes an AnalysisException.
🚀 Application
expert3:00remaining
Calculate cumulative max per group with window functions
You have a Spark DataFrame with columns
category and score. You want to add a column cum_max that shows the highest score seen so far within each category, ordered by score ascending. Which code snippet correctly achieves this?Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.window import Window from pyspark.sql.functions import max spark = SparkSession.builder.master('local').appName('test').getOrCreate() data = [('A', 10), ('A', 20), ('A', 15), ('B', 5), ('B', 25)] df = spark.createDataFrame(data, ['category', 'score']) windowSpec = Window.partitionBy('category').orderBy('score').rowsBetween(Window.unboundedPreceding, Window.currentRow) result = df.withColumn('cum_max', max('score').over(windowSpec)) result.orderBy('category', 'score').select('category', 'score', 'cum_max').collect()
Attempts:
2 left
💡 Hint
Cumulative max updates as you move through the ordered scores within each category.
✗ Incorrect
The window frame includes all rows from the start of the partition up to the current row, so max accumulates the highest score seen so far in ascending order.