0
0
Apache Sparkdata~5 mins

Windowed aggregations in Apache Spark - Cheat Sheet & Quick Revision

Choose your learning style9 modes available
Recall & Review
beginner
What is a window function in Apache Spark?
A window function performs calculations across a set of rows related to the current row, defined by a window specification, without collapsing the rows into a single output row.
Click to reveal answer
intermediate
How does a window specification define the rows for aggregation?
It defines partitioning (grouping rows), ordering (sorting rows), and frame boundaries (range of rows around the current row) to control which rows are included in the window.
Click to reveal answer
beginner
What is the difference between windowed aggregation and groupBy aggregation?
GroupBy aggregates collapse multiple rows into one per group, while windowed aggregation keeps all rows and adds aggregated values as new columns based on the window.
Click to reveal answer
intermediate
Example: What does this Spark code do?
<pre>from pyspark.sql import Window
from pyspark.sql.functions import sum
windowSpec = Window.partitionBy('category').orderBy('date').rowsBetween(Window.unboundedPreceding, 0)
df.withColumn('running_sum', sum('sales').over(windowSpec))</pre>
It calculates a running total of sales for each category, ordered by date, summing all sales from the start up to the current row.
Click to reveal answer
intermediate
What are frame boundaries in window functions?
Frame boundaries define the subset of rows relative to the current row included in the window, such as all previous rows, a fixed number before and after, or the entire partition.
Click to reveal answer
Which Spark function is used to define the window frame for aggregation?
AWindow.rowsBetween()
BWindow.orderBy()
CWindow.groupBy()
DWindow.partitionBy()
What does a windowed aggregation return compared to groupBy aggregation?
AOnly aggregated values
BOne row per group
CAll original rows with extra columns
DNo rows, just summary
In Spark, which method partitions data for window functions?
ApartitionBy()
BorderBy()
CrowsBetween()
DgroupBy()
What does the following frame boundary mean? rowsBetween(Window.unboundedPreceding, 0)
AOnly current row
BAll rows from start to current row
CAll rows after current row
DAll rows in partition
Which of these is NOT a valid window function in Spark?
Asum()
Bavg()
Ccount()
DgroupBy()
Explain how windowed aggregations differ from regular groupBy aggregations in Spark.
Think about how many rows you get back and what extra data is added.
You got /4 concepts.
    Describe the role of partitioning, ordering, and frame boundaries in defining a window specification.
    Consider how these control which rows are included in calculations.
    You got /4 concepts.