Recall & Review
beginner
What is a window function in Apache Spark?
A window function performs calculations across a set of rows related to the current row, defined by a window specification, without collapsing the rows into a single output row.
Click to reveal answer
intermediate
How does a window specification define the rows for aggregation?
It defines partitioning (grouping rows), ordering (sorting rows), and frame boundaries (range of rows around the current row) to control which rows are included in the window.
Click to reveal answer
beginner
What is the difference between windowed aggregation and groupBy aggregation?
GroupBy aggregates collapse multiple rows into one per group, while windowed aggregation keeps all rows and adds aggregated values as new columns based on the window.
Click to reveal answer
intermediate
Example: What does this Spark code do?
<pre>from pyspark.sql import Window
from pyspark.sql.functions import sum
windowSpec = Window.partitionBy('category').orderBy('date').rowsBetween(Window.unboundedPreceding, 0)
df.withColumn('running_sum', sum('sales').over(windowSpec))</pre>It calculates a running total of sales for each category, ordered by date, summing all sales from the start up to the current row.
Click to reveal answer
intermediate
What are frame boundaries in window functions?
Frame boundaries define the subset of rows relative to the current row included in the window, such as all previous rows, a fixed number before and after, or the entire partition.
Click to reveal answer
Which Spark function is used to define the window frame for aggregation?
✗ Incorrect
Window.rowsBetween() sets the frame boundaries for the window, specifying which rows relative to the current row are included.
What does a windowed aggregation return compared to groupBy aggregation?
✗ Incorrect
Windowed aggregation keeps all original rows and adds new columns with aggregated values computed over the window.
In Spark, which method partitions data for window functions?
✗ Incorrect
partitionBy() divides data into groups for window calculations.
What does the following frame boundary mean? rowsBetween(Window.unboundedPreceding, 0)
✗ Incorrect
It includes all rows from the start of the partition up to the current row.
Which of these is NOT a valid window function in Spark?
✗ Incorrect
groupBy() is a grouping method, not a window function.
Explain how windowed aggregations differ from regular groupBy aggregations in Spark.
Think about how many rows you get back and what extra data is added.
You got /4 concepts.
Describe the role of partitioning, ordering, and frame boundaries in defining a window specification.
Consider how these control which rows are included in calculations.
You got /4 concepts.