0
0
Apache Sparkdata~20 mins

Windowed aggregations in Apache Spark - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Windowed Aggregations Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of a sliding window sum
Given the following Spark code, what is the output of the result.show() command?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import sum

spark = SparkSession.builder.master('local').appName('test').getOrCreate()
data = [(1, 10), (2, 20), (3, 30), (4, 40), (5, 50)]
df = spark.createDataFrame(data, ['id', 'value'])

windowSpec = Window.orderBy('id').rowsBetween(-1, 1)

result = df.withColumn('window_sum', sum('value').over(windowSpec))
result.orderBy('id').select('id', 'window_sum').collect()
A[(1, 10), (2, 20), (3, 30), (4, 40), (5, 50)]
B[(1, 10), (2, 30), (3, 60), (4, 90), (5, 140)]
C[(1, 30), (2, 60), (3, 90), (4, 120), (5, 140)]
D[(1, 30), (2, 60), (3, 90), (4, 120), (5, 90)]
Attempts:
2 left
💡 Hint
Think about how the window frame includes the current row and one row before and after.
data_output
intermediate
2:00remaining
Count of rows in a partitioned window
What is the output of the following Spark code snippet that counts rows per group using window functions?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import count

spark = SparkSession.builder.master('local').appName('test').getOrCreate()
data = [('A', 1), ('A', 2), ('B', 3), ('B', 4), ('B', 5)]
df = spark.createDataFrame(data, ['group', 'value'])

windowSpec = Window.partitionBy('group').orderBy('value').rowsBetween(Window.unboundedPreceding, Window.currentRow)

result = df.withColumn('count_so_far', count('value').over(windowSpec))
result.orderBy('group', 'value').select('group', 'value', 'count_so_far').collect()
A[('A', 1, 1), ('A', 2, 2), ('B', 3, 1), ('B', 4, 2), ('B', 5, 3)]
B[('A', 1, 2), ('A', 2, 2), ('B', 3, 3), ('B', 4, 3), ('B', 5, 3)]
C[('A', 1, 1), ('A', 2, 1), ('B', 3, 1), ('B', 4, 1), ('B', 5, 1)]
D[('A', 1, 0), ('A', 2, 1), ('B', 3, 0), ('B', 4, 1), ('B', 5, 2)]
Attempts:
2 left
💡 Hint
Count accumulates rows from the start of the partition up to the current row.
visualization
advanced
2:30remaining
Visualizing moving average with window functions
You have a Spark DataFrame with daily sales data. You want to create a new column showing the 3-day moving average of sales. Which code snippet correctly computes this moving average?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import avg

spark = SparkSession.builder.master('local').appName('test').getOrCreate()
data = [(1, 100), (2, 200), (3, 300), (4, 400), (5, 500)]
df = spark.createDataFrame(data, ['day', 'sales'])

windowSpec = Window.orderBy('day').rowsBetween(-2, 0)

result = df.withColumn('moving_avg', avg('sales').over(windowSpec))
result.orderBy('day').select('day', 'moving_avg').collect()
A[(1, 100.0), (2, 150.0), (3, 250.0), (4, 350.0), (5, 450.0)]
B[(1, 100.0), (2, 150.0), (3, 200.0), (4, 300.0), (5, 400.0)]
C[(1, 100.0), (2, 150.0), (3, 200.0), (4, 300.0), (5, 500.0)]
D[(1, 100.0), (2, 200.0), (3, 300.0), (4, 400.0), (5, 500.0)]
Attempts:
2 left
💡 Hint
The window frame includes the current day and the two previous days.
🔧 Debug
advanced
2:00remaining
Identify the error in window frame specification
What error will the following Spark code raise when executed?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import sum

spark = SparkSession.builder.master('local').appName('test').getOrCreate()
data = [(1, 10), (2, 20), (3, 30)]
df = spark.createDataFrame(data, ['id', 'value'])

windowSpec = Window.orderBy('id').rowsBetween(1, -1)

result = df.withColumn('window_sum', sum('value').over(windowSpec))
result.show()
AAnalysisException: Window frame start must be less than or equal to end
BNo error, outputs sums with reversed window
CTypeError: rowsBetween expects integers
DValueError: orderBy column missing
Attempts:
2 left
💡 Hint
Check the order of the frame boundaries in rowsBetween.
🚀 Application
expert
3:00remaining
Calculate cumulative max per group with window functions
You have a Spark DataFrame with columns category and score. You want to add a column cum_max that shows the highest score seen so far within each category, ordered by score ascending. Which code snippet correctly achieves this?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import max

spark = SparkSession.builder.master('local').appName('test').getOrCreate()
data = [('A', 10), ('A', 20), ('A', 15), ('B', 5), ('B', 25)]
df = spark.createDataFrame(data, ['category', 'score'])

windowSpec = Window.partitionBy('category').orderBy('score').rowsBetween(Window.unboundedPreceding, Window.currentRow)

result = df.withColumn('cum_max', max('score').over(windowSpec))
result.orderBy('category', 'score').select('category', 'score', 'cum_max').collect()
A[('A', 10, 20), ('A', 15, 20), ('A', 20, 20), ('B', 5, 25), ('B', 25, 25)]
B[('A', 10, 10), ('A', 15, 20), ('A', 20, 20), ('B', 5, 5), ('B', 25, 25)]
C[('A', 10, 10), ('A', 15, 15), ('A', 20, 20), ('B', 5, 5), ('B', 25, 25)]
D[('A', 10, 10), ('A', 15, 15), ('A', 20, 15), ('B', 5, 5), ('B', 25, 25)]
Attempts:
2 left
💡 Hint
Cumulative max updates as you move through the ordered scores within each category.