Challenge - 5 Problems

🎖️

Windowed Aggregations Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output of a sliding window sum

Given the following Spark code, what is the output of the result.show() command?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import sum

spark = SparkSession.builder.master('local').appName('test').getOrCreate()
data = [(1, 10), (2, 20), (3, 30), (4, 40), (5, 50)]
df = spark.createDataFrame(data, ['id', 'value'])

windowSpec = Window.orderBy('id').rowsBetween(-1, 1)

result = df.withColumn('window_sum', sum('value').over(windowSpec))
result.orderBy('id').select('id', 'window_sum').collect()

A[(1, 10), (2, 20), (3, 30), (4, 40), (5, 50)]

B[(1, 10), (2, 30), (3, 60), (4, 90), (5, 140)]

C[(1, 30), (2, 60), (3, 90), (4, 120), (5, 140)]

D[(1, 30), (2, 60), (3, 90), (4, 120), (5, 90)]

Attempts:

2 left

❓ data_output

intermediate

2:00remaining

Count of rows in a partitioned window

What is the output of the following Spark code snippet that counts rows per group using window functions?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import count

spark = SparkSession.builder.master('local').appName('test').getOrCreate()
data = [('A', 1), ('A', 2), ('B', 3), ('B', 4), ('B', 5)]
df = spark.createDataFrame(data, ['group', 'value'])

windowSpec = Window.partitionBy('group').orderBy('value').rowsBetween(Window.unboundedPreceding, Window.currentRow)

result = df.withColumn('count_so_far', count('value').over(windowSpec))
result.orderBy('group', 'value').select('group', 'value', 'count_so_far').collect()

A[('A', 1, 1), ('A', 2, 2), ('B', 3, 1), ('B', 4, 2), ('B', 5, 3)]

B[('A', 1, 2), ('A', 2, 2), ('B', 3, 3), ('B', 4, 3), ('B', 5, 3)]

C[('A', 1, 1), ('A', 2, 1), ('B', 3, 1), ('B', 4, 1), ('B', 5, 1)]

D[('A', 1, 0), ('A', 2, 1), ('B', 3, 0), ('B', 4, 1), ('B', 5, 2)]

Attempts:

2 left

❓ visualization

advanced

2:30remaining

Visualizing moving average with window functions

You have a Spark DataFrame with daily sales data. You want to create a new column showing the 3-day moving average of sales. Which code snippet correctly computes this moving average?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import avg

spark = SparkSession.builder.master('local').appName('test').getOrCreate()
data = [(1, 100), (2, 200), (3, 300), (4, 400), (5, 500)]
df = spark.createDataFrame(data, ['day', 'sales'])

windowSpec = Window.orderBy('day').rowsBetween(-2, 0)

result = df.withColumn('moving_avg', avg('sales').over(windowSpec))
result.orderBy('day').select('day', 'moving_avg').collect()

A[(1, 100.0), (2, 150.0), (3, 250.0), (4, 350.0), (5, 450.0)]

B[(1, 100.0), (2, 150.0), (3, 200.0), (4, 300.0), (5, 400.0)]

C[(1, 100.0), (2, 150.0), (3, 200.0), (4, 300.0), (5, 500.0)]

D[(1, 100.0), (2, 200.0), (3, 300.0), (4, 400.0), (5, 500.0)]

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Identify the error in window frame specification

What error will the following Spark code raise when executed?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import sum

spark = SparkSession.builder.master('local').appName('test').getOrCreate()
data = [(1, 10), (2, 20), (3, 30)]
df = spark.createDataFrame(data, ['id', 'value'])

windowSpec = Window.orderBy('id').rowsBetween(1, -1)

result = df.withColumn('window_sum', sum('value').over(windowSpec))
result.show()

AAnalysisException: Window frame start must be less than or equal to end

BNo error, outputs sums with reversed window

CTypeError: rowsBetween expects integers

DValueError: orderBy column missing

Attempts:

2 left

🚀 Application

expert

3:00remaining

Calculate cumulative max per group with window functions

You have a Spark DataFrame with columns category and score. You want to add a column cum_max that shows the highest score seen so far within each category, ordered by score ascending. Which code snippet correctly achieves this?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import max

spark = SparkSession.builder.master('local').appName('test').getOrCreate()
data = [('A', 10), ('A', 20), ('A', 15), ('B', 5), ('B', 25)]
df = spark.createDataFrame(data, ['category', 'score'])

windowSpec = Window.partitionBy('category').orderBy('score').rowsBetween(Window.unboundedPreceding, Window.currentRow)

result = df.withColumn('cum_max', max('score').over(windowSpec))
result.orderBy('category', 'score').select('category', 'score', 'cum_max').collect()

A[('A', 10, 20), ('A', 15, 20), ('A', 20, 20), ('B', 5, 25), ('B', 25, 25)]

B[('A', 10, 10), ('A', 15, 20), ('A', 20, 20), ('B', 5, 5), ('B', 25, 25)]

C[('A', 10, 10), ('A', 15, 15), ('A', 20, 20), ('B', 5, 5), ('B', 25, 25)]

D[('A', 10, 10), ('A', 15, 15), ('A', 20, 15), ('B', 5, 5), ('B', 25, 25)]

Attempts:

2 left