Challenge - 5 Problems
Window Function Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of rank() window function
What is the output of the
rank() column after running this Spark code?Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.window import Window from pyspark.sql.functions import rank spark = SparkSession.builder.getOrCreate() data = [("A", 100), ("B", 200), ("A", 200), ("B", 100), ("A", 100)] df = spark.createDataFrame(data, ["category", "value"]) windowSpec = Window.partitionBy("category").orderBy("value") df = df.withColumn("rank", rank().over(windowSpec)) df.orderBy("category", "value").select("category", "value", "rank").collect()
Attempts:
2 left
💡 Hint
Remember that rank() assigns the same rank to ties and skips ranks after ties.
✗ Incorrect
The rank() function assigns the same rank to rows with the same value within each partition. For category 'A', the two rows with value 100 get rank 1, and the next value 200 gets rank 3 (rank 2 is skipped). For category 'B', values 100 and 200 get ranks 1 and 2 respectively.
❓ data_output
intermediate2:00remaining
Count rows per group using window function
What is the output of the
count column after running this Spark code?Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.window import Window from pyspark.sql.functions import count spark = SparkSession.builder.getOrCreate() data = [("X", 1), ("X", 2), ("Y", 3), ("Y", 4), ("Y", 5)] df = spark.createDataFrame(data, ["group", "value"]) windowSpec = Window.partitionBy("group") df = df.withColumn("count", count("value").over(windowSpec)) df.orderBy("group", "value").select("group", "value", "count").collect()
Attempts:
2 left
💡 Hint
Count over a partition counts all rows in that group for each row.
✗ Incorrect
The count() window function counts the number of rows in each partition. For group 'X' there are 2 rows, so each row shows count 2. For group 'Y' there are 3 rows, so each row shows count 3.
❓ visualization
advanced2:30remaining
Visualize cumulative sum with window function
Which option shows the correct cumulative sum of
value within each category when using this Spark code?Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.window import Window from pyspark.sql.functions import sum spark = SparkSession.builder.getOrCreate() data = [("A", 10), ("A", 20), ("B", 5), ("B", 15), ("A", 30)] df = spark.createDataFrame(data, ["category", "value"]) windowSpec = Window.partitionBy("category").orderBy("value").rowsBetween(Window.unboundedPreceding, Window.currentRow) df = df.withColumn("cumulative_sum", sum("value").over(windowSpec)) df.orderBy("category", "value").select("category", "value", "cumulative_sum").collect()
Attempts:
2 left
💡 Hint
Cumulative sum adds all previous values including the current one in order.
✗ Incorrect
The cumulative sum adds values in order within each category. For 'A', ordered values are 10, 20, 30, sums are 10, 30, 60. For 'B', ordered values 5, 15, sums 5, 20.
🔧 Debug
advanced2:00remaining
Identify error in window function usage
What error does this Spark code raise?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.window import Window from pyspark.sql.functions import lag spark = SparkSession.builder.getOrCreate() data = [(1, 100), (2, 200), (3, 300)] df = spark.createDataFrame(data, ["id", "value"]) windowSpec = Window.orderBy("id") df = df.withColumn("prev_value", lag("value", 2).over(windowSpec)) df.show()
Attempts:
2 left
💡 Hint
lag() can be used with only orderBy in window specification.
✗ Incorrect
The code runs without error. lag() with offset 2 returns null for first two rows because no previous rows exist at that offset. partitionBy is optional.
🚀 Application
expert3:00remaining
Calculate moving average with sliding window
Given this Spark code, what is the output of the
moving_avg column?Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.window import Window from pyspark.sql.functions import avg spark = SparkSession.builder.getOrCreate() data = [("2023-01-01", 10), ("2023-01-02", 20), ("2023-01-03", 30), ("2023-01-04", 40), ("2023-01-05", 50)] df = spark.createDataFrame(data, ["date", "value"]) windowSpec = Window.orderBy("date").rowsBetween(-2, 0) df = df.withColumn("moving_avg", avg("value").over(windowSpec)) df.orderBy("date").select("date", "value", "moving_avg").collect()
Attempts:
2 left
💡 Hint
Moving average with rowsBetween(-2, 0) averages current row and two previous rows.
✗ Incorrect
For each row, the moving average is the average of the current value and the two previous values (if they exist). For example, on 2023-01-03, average of values 10, 20, 30 is 20.0.