beginner

What is a window function in Apache Spark?

A window function performs calculations across a set of rows related to the current row, defined by a window specification, without collapsing the rows into a single output row.

Click to reveal answer

intermediate

How does a window specification define the rows for aggregation?

It defines partitioning (grouping rows), ordering (sorting rows), and frame boundaries (range of rows around the current row) to control which rows are included in the window.

Click to reveal answer

beginner

What is the difference between windowed aggregation and groupBy aggregation?

GroupBy aggregates collapse multiple rows into one per group, while windowed aggregation keeps all rows and adds aggregated values as new columns based on the window.

Click to reveal answer

intermediate

Example: What does this Spark code do?
<pre>from pyspark.sql import Window
from pyspark.sql.functions import sum
windowSpec = Window.partitionBy('category').orderBy('date').rowsBetween(Window.unboundedPreceding, 0)
df.withColumn('running_sum', sum('sales').over(windowSpec))</pre>

It calculates a running total of sales for each category, ordered by date, summing all sales from the start up to the current row.

Click to reveal answer

intermediate

What are frame boundaries in window functions?

Frame boundaries define the subset of rows relative to the current row included in the window, such as all previous rows, a fixed number before and after, or the entire partition.

Click to reveal answer

Which Spark function is used to define the window frame for aggregation?

AWindow.rowsBetween()

BWindow.orderBy()

CWindow.groupBy()

DWindow.partitionBy()

What does a windowed aggregation return compared to groupBy aggregation?

AOnly aggregated values

BOne row per group

CAll original rows with extra columns

DNo rows, just summary

In Spark, which method partitions data for window functions?

ApartitionBy()

BorderBy()

CrowsBetween()

DgroupBy()

What does the following frame boundary mean? rowsBetween(Window.unboundedPreceding, 0)

AOnly current row

BAll rows from start to current row

CAll rows after current row

DAll rows in partition

Which of these is NOT a valid window function in Spark?

Asum()

Bavg()

Ccount()

DgroupBy()

Explain how windowed aggregations differ from regular groupBy aggregations in Spark.

Describe the role of partitioning, ordering, and frame boundaries in defining a window specification.